Context Navigation

Changes between Version 1 and Version 2 of April 2012 Keeping the bots green

Timestamp:: Apr 26, 2012, 7:08:15 AM (13 years ago)
Author:: aroben@webkit.org
Comment:: Fix indentation, add real headings

Legend:

: Unmodified
: Added
: Removed
: Modified

TabularUnified April 2012 Keeping the bots green

-              v1
+              v2
+Keeping the bots green
+----------------------
+= Keeping the bots green =
+Things that cause bots to go red
         Commits that break the build
         Commits that break tests
         Flakey tests
         Script bugs
         Flakey machines, OS bugs
                 Kernel panics, overloaded system components
                 Limited disk space on build.webkit.org (currently being addressed)
+== Things that cause bots to go red ==
+ * Commits that break the build
+ * Commits that break tests
+ * Flakey tests
+ * Script bugs
+ * Flakey machines, OS bugs
+   * Kernel panics, overloaded system components
+   * Limited disk space on build.webkit.org (currently being addressed)
+Improve build state feedback for committers
         Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
         Tester bots get behind
         Testers don't test the latest build
         Waterfall page doesn’t show enough info
                 Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
         If the bot is already red, it is hard to tell if your change made it worse
                 Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)
+== Improve build state feedback for committers ==
+ * Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
+ * Tester bots get behind
+ * Testers don't test the latest build
+ * Waterfall page doesn’t show enough info
+   * Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
+ * If the bot is already red, it is hard to tell if your change made it worse
+   * Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)
+Gardening approaches
         Apple’s
 people watching the bots concurrently every week (just during regular work hours)
                 Builds hovering close to green, but still go red multiple times a day
         Chromium’s
 sets of bots
                         chromium.org - ToT Chromium with a fixed WebKit
                                 Very strict green policy, aggressively rolling out changes that make the bots red
                                 Mostly green with spots of red
                         build.webkit.org - ToT WebKit with a fixed version of Chromium
                                 Ad hoc maintenance
                                 Red most of the time
                         Canaries - ToT Chromium with ToT WebKit
                                 Rotating shifts of 1 person for 4 days at a time
                                 Around the clock coverage
                                 Most of the time spent updating baselines because the Chromium bots run the pixel tests
                                 Also red most of the time
                                 This is what the gardeners look at first
         Qt has Ossy
                 Leading a group of 6 people
                 They have found that it requires experience to identify which checkins caused which failures
         GTK
                 Makes use of TestFailures + build.webkit.org/waterfall
                 For a while they had one person per day gardening, but right now it is up to individual contributors
+== Gardening approaches ==
+ * Apple’s
+   * 2 people watching the bots concurrently every week (just during regular work hours)
+   * Builds hovering close to green, but still go red multiple times a day
+ * Chromium’s
+   * 3 sets of bots
+     * chromium.org - ToT Chromium with a fixed WebKit
+       * Very strict green policy, aggressively rolling out changes that make the bots red
+       * Mostly green with spots of red
+     * build.webkit.org - ToT WebKit with a fixed version of Chromium
+       * Ad hoc maintenance
+       * Red most of the time
+     * Canaries - ToT Chromium with ToT WebKit
+       * Rotating shifts of 1 person for 4 days at a time
+       * Around the clock coverage
+       * Most of the time spent updating baselines because the Chromium bots run the pixel tests
+       * Also red most of the time
+       * This is what the gardeners look at first
+ * Qt has Ossy
+   * Leading a group of 6 people
+   * They have found that it requires experience to identify which checkins caused which failures
+ * GTK
+   * Makes use of TestFailures + build.webkit.org/waterfall
+   * For a while they had one person per day gardening, but right now it is up to individual contributors
+Can we share more tools?
         Ideally we should all use the same status page to determine if something broke
         Darin's ideal tool
                 Identifies when something is broken that wasn't broken before
                 Present the suspect set of revisions
                 Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
         Garden-O-Matic
                 Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
                         Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
                         FIXME: bugs for these?
                 Client-side tool, runs a local webserver
                 Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
         We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
                 Full Buildbot is not checked into WebKit, only some configuration files
                         For build.webkit.org, changing config files automatically restarts master
                         The Chromium infrastructure team tracks Buildbot ToT pretty closely
                 For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
                 FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
         Qt is working on a tool to run just the relevant tests to your change, based on code coverage
                 Still doesn't deal with flakey tests well
+== Can we share more tools? ==
+ * Ideally we should all use the same status page to determine if something broke
+ * Darin's ideal tool
+   * Identifies when something is broken that wasn't broken before
+   * Present the suspect set of revisions
+   * Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
+ * Garden-O-Matic
+   * Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
+     * Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
+     * FIXME: bugs for these?
+   * Client-side tool, runs a local webserver
+   * Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
+ * We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
+   * Full Buildbot is not checked into WebKit, only some configuration files
+     * For build.webkit.org, changing config files automatically restarts master
+     * The Chromium infrastructure team tracks Buildbot ToT pretty closely
+   * For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
+   * FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
+ * Qt is working on a tool to run just the relevant tests to your change, based on code coverage
+   * Still doesn't deal with flakey tests well
+Infrastructure/waterfall improvements
+        More EWS bots?
+                FIXME: Get the Mac EWS bots running the tests
+        Easier way to identify a commit that caused a test failure
+        Automatic notifications of build/test failure
+                The issue here is that when there are infrastructure problems, everything got a notification
+                FIXME: We should bring back the automatic notifications from sheriffbot
+Distributing builds / testing (e.g. diskcc)
+        Chromium does some distributed compilation, speeds up builds quite a lot
+        Chromium sometimes splits up tests - runs half on one machine and half on another
+                dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync
+== Infrastructure/waterfall improvements ==
+ * More EWS bots?
+   * FIXME: Get the Mac EWS bots running the tests
+ * Easier way to identify a commit that caused a test failure
+ * Automatic notifications of build/test failure
+   * The issue here is that when there are infrastructure problems, everything got a notification
+   * FIXME: We should bring back the automatic notifications from sheriffbot
+Pixel tests
+        Could try to convert more to ref tests
+        Garden-O-Matic will help with this
+        Are pixel tests worth it?
+                Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
+                Neither Chromium nor Apple has run a cost / benefit analysis of them
+                One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
+                It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
+                        This results in a lot of skipping
+Discussion of the No Commits Until It Is Green policy
+        Would need to be actively enforced by tools
+        Chromium has this policy today but
+                there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
+                they have been able to isolate machine-related flakiness
+                they have tryjobs, unit tests, all required to be run before checking in
+        Probably not feasible yet for WebKit anyways due to tooling
+                Need a way to grab new baselines for all ports
+                Need EWS support for tests for all ports
+== Distributing builds / testing (e.g. diskcc) ==
+ * Chromium does some distributed compilation, speeds up builds quite a lot
+ * Chromium sometimes splits up tests - runs half on one machine and half on another
+   * dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync
+Discussion of causes of flakey tests
+. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
+                dpranke may have something for this soon
+. Tests run before
+                e.g. setTimeout that gets applied to the next test
+                Seems like we could improve our tools for that
+. Memory corruption
+        dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness
+== Pixel tests ==
+ * Could try to convert more to ref tests
+ * Garden-O-Matic will help with this
+ * Are pixel tests worth it?
+   * Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
+   * Neither Chromium nor Apple has run a cost / benefit analysis of them
+   * One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
+   * It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
+     * This results in a lot of skipping
+== Discussion of the No Commits Until It Is Green policy ==
+ * Would need to be actively enforced by tools
+ * Chromium has this policy today but
+   * there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
+   * they have been able to isolate machine-related flakiness
+   * they have tryjobs, unit tests, all required to be run before checking in
+ * Probably not feasible yet for WebKit anyways due to tooling
+   * Need a way to grab new baselines for all ports
+   * Need EWS support for tests for all ports
+Workflow improvements
+        More emphasis on running tests before committing
+        Easier to generate platform-specific test results
+        Standard process
+                If a test is failing, there should be one standard way of dealing with it
+                We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)
+== Discussion of causes of flakey tests ==
+. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
+   * dpranke may have something for this soon
+. Tests run before
+   * e.g. setTimeout that gets applied to the next test
+   * Seems like we could improve our tools for that
+. Memory corruption
+ * dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness
+Skipped tests are technical debt
+        More emphasis on unskipping?
+== Workflow improvements ==
+ * More emphasis on running tests before committing
+ * Easier to generate platform-specific test results
+ * Standard process
+   * If a test is failing, there should be one standard way of dealing with it
+   * We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)
+== Skipped tests are technical debt ==
+ * More emphasis on unskipping?