= Keeping the bots green =

== Things that cause bots to go red ==
 * Commits that break the build
 * Commits that break tests
 * Flakey tests
 * Script bugs
 * Flakey machines, OS bugs
   * Kernel panics, overloaded system components
   * Limited disk space on build.webkit.org (currently being addressed)


== Improve build state feedback for committers ==
 * Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
 * Tester bots get behind
 * Testers don't test the latest build
 * Waterfall page doesn’t show enough info
   * Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
 * If the bot is already red, it is hard to tell if your change made it worse
   * Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)


== Gardening approaches ==
 * Apple’s
   * 2 people watching the bots concurrently every week (just during regular work hours)
   * Builds hovering close to green, but still go red multiple times a day
 * Chromium’s
   * 3 sets of bots
     * chromium.org - ToT Chromium with a fixed WebKit
       * Very strict green policy, aggressively rolling out changes that make the bots red
       * Mostly green with spots of red
     * build.webkit.org - ToT WebKit with a fixed version of Chromium
       * Ad hoc maintenance
       * Red most of the time
     * Canaries - ToT Chromium with ToT WebKit
       * Rotating shifts of 1 person for 4 days at a time
       * Around the clock coverage
       * Most of the time spent updating baselines because the Chromium bots run the pixel tests
       * Also red most of the time
       * This is what the gardeners look at first
 * Qt has Ossy
   * Leading a group of 6 people
   * They have found that it requires experience to identify which checkins caused which failures
 * GTK
   * Makes use of TestFailures + build.webkit.org/waterfall
   * For a while they had one person per day gardening, but right now it is up to individual contributors


== Can we share more tools? ==
 * Ideally we should all use the same status page to determine if something broke
 * Darin's ideal tool
   * Identifies when something is broken that wasn't broken before
   * Present the suspect set of revisions
   * Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
 * Garden-O-Matic
   * Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
     * Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
     * FIXME: bugs for these?
   * Client-side tool, runs a local webserver
   * Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
 * We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
   * Full Buildbot is not checked into WebKit, only some configuration files
     * For build.webkit.org, changing config files automatically restarts master
     * The Chromium infrastructure team tracks Buildbot ToT pretty closely
   * For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
   * FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
 * Qt is working on a tool to run just the relevant tests to your change, based on code coverage
   * Still doesn't deal with flakey tests well


== Infrastructure/waterfall improvements ==
 * More EWS bots?
   * FIXME: Get the Mac EWS bots running the tests
 * Easier way to identify a commit that caused a test failure
 * Automatic notifications of build/test failure
   * The issue here is that when there are infrastructure problems, everything got a notification
   * FIXME: We should bring back the automatic notifications from sheriffbot


== Distributing builds / testing (e.g. diskcc) ==
 * Chromium does some distributed compilation, speeds up builds quite a lot
 * Chromium sometimes splits up tests - runs half on one machine and half on another
   * dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync


== Pixel tests ==
 * Could try to convert more to ref tests
 * Garden-O-Matic will help with this
 * Are pixel tests worth it?
   * Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
   * Neither Chromium nor Apple has run a cost / benefit analysis of them
   * One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
   * It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
     * This results in a lot of skipping

== Discussion of the No Commits Until It Is Green policy ==
 * Would need to be actively enforced by tools
 * Chromium has this policy today but
   * there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
   * they have been able to isolate machine-related flakiness
   * they have tryjobs, unit tests, all required to be run before checking in
 * Probably not feasible yet for WebKit anyways due to tooling
   * Need a way to grab new baselines for all ports
   * Need EWS support for tests for all ports


== Discussion of causes of flakey tests ==
 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
   * dpranke may have something for this soon
 2. Tests run before
   * e.g. setTimeout that gets applied to the next test
   * Seems like we could improve our tools for that
 3. Memory corruption
 * dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness


== Workflow improvements ==
 * More emphasis on running tests before committing
 * Easier to generate platform-specific test results
 * Standard process
   * If a test is failing, there should be one standard way of dealing with it
   * We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)

== Skipped tests are technical debt ==
 * More emphasis on unskipping?