| | 1 | Keeping the bots green |
| | 2 | ---------------------- |
| | 3 | |
| | 4 | Things that cause bots to go red |
| | 5 | Commits that break the build |
| | 6 | Commits that break tests |
| | 7 | Flakey tests |
| | 8 | Script bugs |
| | 9 | Flakey machines, OS bugs |
| | 10 | Kernel panics, overloaded system components |
| | 11 | Limited disk space on build.webkit.org (currently being addressed) |
| | 12 | |
| | 13 | |
| | 14 | Improve build state feedback for committers |
| | 15 | Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) |
| | 16 | Tester bots get behind |
| | 17 | Testers don't test the latest build |
| | 18 | Waterfall page doesn’t show enough info |
| | 19 | Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things |
| | 20 | If the bot is already red, it is hard to tell if your change made it worse |
| | 21 | Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing) |
| | 22 | |
| | 23 | |
| | 24 | Gardening approaches |
| | 25 | Apple’s |
| | 26 | 2 people watching the bots concurrently every week (just during regular work hours) |
| | 27 | Builds hovering close to green, but still go red multiple times a day |
| | 28 | Chromium’s |
| | 29 | 3 sets of bots |
| | 30 | chromium.org - ToT Chromium with a fixed WebKit |
| | 31 | Very strict green policy, aggressively rolling out changes that make the bots red |
| | 32 | Mostly green with spots of red |
| | 33 | build.webkit.org - ToT WebKit with a fixed version of Chromium |
| | 34 | Ad hoc maintenance |
| | 35 | Red most of the time |
| | 36 | Canaries - ToT Chromium with ToT WebKit |
| | 37 | Rotating shifts of 1 person for 4 days at a time |
| | 38 | Around the clock coverage |
| | 39 | Most of the time spent updating baselines because the Chromium bots run the pixel tests |
| | 40 | Also red most of the time |
| | 41 | This is what the gardeners look at first |
| | 42 | Qt has Ossy |
| | 43 | Leading a group of 6 people |
| | 44 | They have found that it requires experience to identify which checkins caused which failures |
| | 45 | GTK |
| | 46 | Makes use of TestFailures + build.webkit.org/waterfall |
| | 47 | For a while they had one person per day gardening, but right now it is up to individual contributors |
| | 48 | |
| | 49 | |
| | 50 | Can we share more tools? |
| | 51 | Ideally we should all use the same status page to determine if something broke |
| | 52 | Darin's ideal tool |
| | 53 | Identifies when something is broken that wasn't broken before |
| | 54 | Present the suspect set of revisions |
| | 55 | Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people |
| | 56 | Garden-O-Matic |
| | 57 | Built on top of code in webkit.py in the WebKit tree, designed to be used by any port |
| | 58 | Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports |
| | 59 | FIXME: bugs for these? |
| | 60 | Client-side tool, runs a local webserver |
| | 61 | Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree |
| | 62 | We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org |
| | 63 | Full Buildbot is not checked into WebKit, only some configuration files |
| | 64 | For build.webkit.org, changing config files automatically restarts master |
| | 65 | The Chromium infrastructure team tracks Buildbot ToT pretty closely |
| | 66 | For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views |
| | 67 | FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org? |
| | 68 | Qt is working on a tool to run just the relevant tests to your change, based on code coverage |
| | 69 | Still doesn't deal with flakey tests well |
| | 70 | |
| | 71 | |
| | 72 | Infrastructure/waterfall improvements |
| | 73 | More EWS bots? |
| | 74 | FIXME: Get the Mac EWS bots running the tests |
| | 75 | Easier way to identify a commit that caused a test failure |
| | 76 | Automatic notifications of build/test failure |
| | 77 | The issue here is that when there are infrastructure problems, everything got a notification |
| | 78 | FIXME: We should bring back the automatic notifications from sheriffbot |
| | 79 | |
| | 80 | |
| | 81 | Distributing builds / testing (e.g. diskcc) |
| | 82 | Chromium does some distributed compilation, speeds up builds quite a lot |
| | 83 | Chromium sometimes splits up tests - runs half on one machine and half on another |
| | 84 | dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync |
| | 85 | |
| | 86 | |
| | 87 | Pixel tests |
| | 88 | Could try to convert more to ref tests |
| | 89 | Garden-O-Matic will help with this |
| | 90 | Are pixel tests worth it? |
| | 91 | Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them |
| | 92 | Neither Chromium nor Apple has run a cost / benefit analysis of them |
| | 93 | One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs |
| | 94 | It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression |
| | 95 | This results in a lot of skipping |
| | 96 | |
| | 97 | Discussion of the No Commits Until It Is Green policy |
| | 98 | Would need to be actively enforced by tools |
| | 99 | Chromium has this policy today but |
| | 100 | there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) |
| | 101 | they have been able to isolate machine-related flakiness |
| | 102 | they have tryjobs, unit tests, all required to be run before checking in |
| | 103 | Probably not feasible yet for WebKit anyways due to tooling |
| | 104 | Need a way to grab new baselines for all ports |
| | 105 | Need EWS support for tests for all ports |
| | 106 | |
| | 107 | |
| | 108 | Discussion of causes of flakey tests |
| | 109 | 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded |
| | 110 | dpranke may have something for this soon |
| | 111 | 2. Tests run before |
| | 112 | e.g. setTimeout that gets applied to the next test |
| | 113 | Seems like we could improve our tools for that |
| | 114 | 3. Memory corruption |
| | 115 | dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness |
| | 116 | |
| | 117 | |
| | 118 | Workflow improvements |
| | 119 | More emphasis on running tests before committing |
| | 120 | Easier to generate platform-specific test results |
| | 121 | Standard process |
| | 122 | If a test is failing, there should be one standard way of dealing with it |
| | 123 | We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?) |
| | 124 | |
| | 125 | |
| | 126 | Skipped tests are technical debt |
| | 127 | More emphasis on unskipping? |