| 1 | Keeping the bots green |
| 2 | ---------------------- |
| 3 | |
| 4 | Things that cause bots to go red |
| 5 | Commits that break the build |
| 6 | Commits that break tests |
| 7 | Flakey tests |
| 8 | Script bugs |
| 9 | Flakey machines, OS bugs |
| 10 | Kernel panics, overloaded system components |
| 11 | Limited disk space on build.webkit.org (currently being addressed) |
| 12 | |
| 13 | |
| 14 | Improve build state feedback for committers |
| 15 | Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) |
| 16 | Tester bots get behind |
| 17 | Testers don't test the latest build |
| 18 | Waterfall page doesn’t show enough info |
| 19 | Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things |
| 20 | If the bot is already red, it is hard to tell if your change made it worse |
| 21 | Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing) |
| 22 | |
| 23 | |
| 24 | Gardening approaches |
| 25 | Apple’s |
| 26 | 2 people watching the bots concurrently every week (just during regular work hours) |
| 27 | Builds hovering close to green, but still go red multiple times a day |
| 28 | Chromium’s |
| 29 | 3 sets of bots |
| 30 | chromium.org - ToT Chromium with a fixed WebKit |
| 31 | Very strict green policy, aggressively rolling out changes that make the bots red |
| 32 | Mostly green with spots of red |
| 33 | build.webkit.org - ToT WebKit with a fixed version of Chromium |
| 34 | Ad hoc maintenance |
| 35 | Red most of the time |
| 36 | Canaries - ToT Chromium with ToT WebKit |
| 37 | Rotating shifts of 1 person for 4 days at a time |
| 38 | Around the clock coverage |
| 39 | Most of the time spent updating baselines because the Chromium bots run the pixel tests |
| 40 | Also red most of the time |
| 41 | This is what the gardeners look at first |
| 42 | Qt has Ossy |
| 43 | Leading a group of 6 people |
| 44 | They have found that it requires experience to identify which checkins caused which failures |
| 45 | GTK |
| 46 | Makes use of TestFailures + build.webkit.org/waterfall |
| 47 | For a while they had one person per day gardening, but right now it is up to individual contributors |
| 48 | |
| 49 | |
| 50 | Can we share more tools? |
| 51 | Ideally we should all use the same status page to determine if something broke |
| 52 | Darin's ideal tool |
| 53 | Identifies when something is broken that wasn't broken before |
| 54 | Present the suspect set of revisions |
| 55 | Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people |
| 56 | Garden-O-Matic |
| 57 | Built on top of code in webkit.py in the WebKit tree, designed to be used by any port |
| 58 | Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports |
| 59 | FIXME: bugs for these? |
| 60 | Client-side tool, runs a local webserver |
| 61 | Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree |
| 62 | We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org |
| 63 | Full Buildbot is not checked into WebKit, only some configuration files |
| 64 | For build.webkit.org, changing config files automatically restarts master |
| 65 | The Chromium infrastructure team tracks Buildbot ToT pretty closely |
| 66 | For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views |
| 67 | FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org? |
| 68 | Qt is working on a tool to run just the relevant tests to your change, based on code coverage |
| 69 | Still doesn't deal with flakey tests well |
| 70 | |
| 71 | |
| 72 | Infrastructure/waterfall improvements |
| 73 | More EWS bots? |
| 74 | FIXME: Get the Mac EWS bots running the tests |
| 75 | Easier way to identify a commit that caused a test failure |
| 76 | Automatic notifications of build/test failure |
| 77 | The issue here is that when there are infrastructure problems, everything got a notification |
| 78 | FIXME: We should bring back the automatic notifications from sheriffbot |
| 79 | |
| 80 | |
| 81 | Distributing builds / testing (e.g. diskcc) |
| 82 | Chromium does some distributed compilation, speeds up builds quite a lot |
| 83 | Chromium sometimes splits up tests - runs half on one machine and half on another |
| 84 | dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync |
| 85 | |
| 86 | |
| 87 | Pixel tests |
| 88 | Could try to convert more to ref tests |
| 89 | Garden-O-Matic will help with this |
| 90 | Are pixel tests worth it? |
| 91 | Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them |
| 92 | Neither Chromium nor Apple has run a cost / benefit analysis of them |
| 93 | One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs |
| 94 | It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression |
| 95 | This results in a lot of skipping |
| 96 | |
| 97 | Discussion of the No Commits Until It Is Green policy |
| 98 | Would need to be actively enforced by tools |
| 99 | Chromium has this policy today but |
| 100 | there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) |
| 101 | they have been able to isolate machine-related flakiness |
| 102 | they have tryjobs, unit tests, all required to be run before checking in |
| 103 | Probably not feasible yet for WebKit anyways due to tooling |
| 104 | Need a way to grab new baselines for all ports |
| 105 | Need EWS support for tests for all ports |
| 106 | |
| 107 | |
| 108 | Discussion of causes of flakey tests |
| 109 | 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded |
| 110 | dpranke may have something for this soon |
| 111 | 2. Tests run before |
| 112 | e.g. setTimeout that gets applied to the next test |
| 113 | Seems like we could improve our tools for that |
| 114 | 3. Memory corruption |
| 115 | dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness |
| 116 | |
| 117 | |
| 118 | Workflow improvements |
| 119 | More emphasis on running tests before committing |
| 120 | Easier to generate platform-specific test results |
| 121 | Standard process |
| 122 | If a test is failing, there should be one standard way of dealing with it |
| 123 | We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?) |
| 124 | |
| 125 | |
| 126 | Skipped tests are technical debt |
| 127 | More emphasis on unskipping? |