= Keeping the bots green = == Things that cause bots to go red == * Commits that break the build * Commits that break tests * Flakey tests * Script bugs * Flakey machines, OS bugs * Kernel panics, overloaded system components * Limited disk space on build.webkit.org (currently being addressed) == Improve build state feedback for committers == * Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) * Tester bots get behind * Testers don't test the latest build * Waterfall page doesn’t show enough info * Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things * If the bot is already red, it is hard to tell if your change made it worse * Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing) == Gardening approaches == * Apple’s * 2 people watching the bots concurrently every week (just during regular work hours) * Builds hovering close to green, but still go red multiple times a day * Chromium’s * 3 sets of bots * chromium.org - ToT Chromium with a fixed WebKit * Very strict green policy, aggressively rolling out changes that make the bots red * Mostly green with spots of red * build.webkit.org - ToT WebKit with a fixed version of Chromium * Ad hoc maintenance * Red most of the time * Canaries - ToT Chromium with ToT WebKit * Rotating shifts of 1 person for 4 days at a time * Around the clock coverage * Most of the time spent updating baselines because the Chromium bots run the pixel tests * Also red most of the time * This is what the gardeners look at first * Qt has Ossy * Leading a group of 6 people * They have found that it requires experience to identify which checkins caused which failures * GTK * Makes use of TestFailures + build.webkit.org/waterfall * For a while they had one person per day gardening, but right now it is up to individual contributors == Can we share more tools? == * Ideally we should all use the same status page to determine if something broke * Darin's ideal tool * Identifies when something is broken that wasn't broken before * Present the suspect set of revisions * Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people * Garden-O-Matic * Built on top of code in webkit.py in the WebKit tree, designed to be used by any port * Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports * FIXME: bugs for these? * Client-side tool, runs a local webserver * Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree * We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org * Full Buildbot is not checked into WebKit, only some configuration files * For build.webkit.org, changing config files automatically restarts master * The Chromium infrastructure team tracks Buildbot ToT pretty closely * For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views * FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org? * Qt is working on a tool to run just the relevant tests to your change, based on code coverage * Still doesn't deal with flakey tests well == Infrastructure/waterfall improvements == * More EWS bots? * FIXME: Get the Mac EWS bots running the tests * Easier way to identify a commit that caused a test failure * Automatic notifications of build/test failure * The issue here is that when there are infrastructure problems, everything got a notification * FIXME: We should bring back the automatic notifications from sheriffbot == Distributing builds / testing (e.g. diskcc) == * Chromium does some distributed compilation, speeds up builds quite a lot * Chromium sometimes splits up tests - runs half on one machine and half on another * dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync == Pixel tests == * Could try to convert more to ref tests * Garden-O-Matic will help with this * Are pixel tests worth it? * Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them * Neither Chromium nor Apple has run a cost / benefit analysis of them * One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs * It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression * This results in a lot of skipping == Discussion of the No Commits Until It Is Green policy == * Would need to be actively enforced by tools * Chromium has this policy today but * there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) * they have been able to isolate machine-related flakiness * they have tryjobs, unit tests, all required to be run before checking in * Probably not feasible yet for WebKit anyways due to tooling * Need a way to grab new baselines for all ports * Need EWS support for tests for all ports == Discussion of causes of flakey tests == 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded * dpranke may have something for this soon 2. Tests run before * e.g. setTimeout that gets applied to the next test * Seems like we could improve our tools for that 3. Memory corruption * dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness == Workflow improvements == * More emphasis on running tests before committing * Easier to generate platform-specific test results * Standard process * If a test is failing, there should be one standard way of dealing with it * We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?) == Skipped tests are technical debt == * More emphasis on unskipping?