Changes between Version 1 and Version 2 of April 2012 Keeping the bots green


Ignore:
Timestamp:
Apr 26, 2012, 7:08:15 AM (13 years ago)
Author:
aroben@webkit.org
Comment:

Fix indentation, add real headings

Legend:

Unmodified
Added
Removed
Modified
  • April 2012 Keeping the bots green

    v1 v2  
    1 Keeping the bots green
    2 ----------------------
     1= Keeping the bots green =
    32
    4 Things that cause bots to go red
    5         Commits that break the build
    6         Commits that break tests
    7         Flakey tests
    8         Script bugs
    9         Flakey machines, OS bugs
    10                 Kernel panics, overloaded system components
    11                 Limited disk space on build.webkit.org (currently being addressed)
     3== Things that cause bots to go red ==
     4 * Commits that break the build
     5 * Commits that break tests
     6 * Flakey tests
     7 * Script bugs
     8 * Flakey machines, OS bugs
     9   * Kernel panics, overloaded system components
     10   * Limited disk space on build.webkit.org (currently being addressed)
    1211
    1312
    14 Improve build state feedback for committers
    15         Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
    16         Tester bots get behind
    17         Testers don't test the latest build
    18         Waterfall page doesn’t show enough info
    19                 Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
    20         If the bot is already red, it is hard to tell if your change made it worse
    21                 Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)
     13== Improve build state feedback for committers ==
     14 * Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
     15 * Tester bots get behind
     16 * Testers don't test the latest build
     17 * Waterfall page doesn’t show enough info
     18   * Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
     19 * If the bot is already red, it is hard to tell if your change made it worse
     20   * Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)
    2221
    2322
    24 Gardening approaches
    25         Apple’s
    26                 2 people watching the bots concurrently every week (just during regular work hours)
    27                 Builds hovering close to green, but still go red multiple times a day
    28         Chromium’s
    29                 3 sets of bots
    30                         chromium.org - ToT Chromium with a fixed WebKit
    31                                 Very strict green policy, aggressively rolling out changes that make the bots red
    32                                 Mostly green with spots of red
    33                         build.webkit.org - ToT WebKit with a fixed version of Chromium
    34                                 Ad hoc maintenance
    35                                 Red most of the time
    36                         Canaries - ToT Chromium with ToT WebKit
    37                                 Rotating shifts of 1 person for 4 days at a time
    38                                 Around the clock coverage
    39                                 Most of the time spent updating baselines because the Chromium bots run the pixel tests
    40                                 Also red most of the time
    41                                 This is what the gardeners look at first
    42         Qt has Ossy
    43                 Leading a group of 6 people
    44                 They have found that it requires experience to identify which checkins caused which failures
    45         GTK
    46                 Makes use of TestFailures + build.webkit.org/waterfall
    47                 For a while they had one person per day gardening, but right now it is up to individual contributors
     23== Gardening approaches ==
     24 * Apple’s
     25   * 2 people watching the bots concurrently every week (just during regular work hours)
     26   * Builds hovering close to green, but still go red multiple times a day
     27 * Chromium’s
     28   * 3 sets of bots
     29     * chromium.org - ToT Chromium with a fixed WebKit
     30       * Very strict green policy, aggressively rolling out changes that make the bots red
     31       * Mostly green with spots of red
     32     * build.webkit.org - ToT WebKit with a fixed version of Chromium
     33       * Ad hoc maintenance
     34       * Red most of the time
     35     * Canaries - ToT Chromium with ToT WebKit
     36       * Rotating shifts of 1 person for 4 days at a time
     37       * Around the clock coverage
     38       * Most of the time spent updating baselines because the Chromium bots run the pixel tests
     39       * Also red most of the time
     40       * This is what the gardeners look at first
     41 * Qt has Ossy
     42   * Leading a group of 6 people
     43   * They have found that it requires experience to identify which checkins caused which failures
     44 * GTK
     45   * Makes use of TestFailures + build.webkit.org/waterfall
     46   * For a while they had one person per day gardening, but right now it is up to individual contributors
    4847
    4948
    50 Can we share more tools?
    51         Ideally we should all use the same status page to determine if something broke
    52         Darin's ideal tool
    53                 Identifies when something is broken that wasn't broken before
    54                 Present the suspect set of revisions
    55                 Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
    56         Garden-O-Matic
    57                 Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
    58                         Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
    59                         FIXME: bugs for these?
    60                 Client-side tool, runs a local webserver
    61                 Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
    62         We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
    63                 Full Buildbot is not checked into WebKit, only some configuration files
    64                         For build.webkit.org, changing config files automatically restarts master
    65                         The Chromium infrastructure team tracks Buildbot ToT pretty closely
    66                 For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
    67                 FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
    68         Qt is working on a tool to run just the relevant tests to your change, based on code coverage
    69                 Still doesn't deal with flakey tests well
     49== Can we share more tools? ==
     50 * Ideally we should all use the same status page to determine if something broke
     51 * Darin's ideal tool
     52   * Identifies when something is broken that wasn't broken before
     53   * Present the suspect set of revisions
     54   * Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
     55 * Garden-O-Matic
     56   * Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
     57     * Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
     58     * FIXME: bugs for these?
     59   * Client-side tool, runs a local webserver
     60   * Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
     61 * We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
     62   * Full Buildbot is not checked into WebKit, only some configuration files
     63     * For build.webkit.org, changing config files automatically restarts master
     64     * The Chromium infrastructure team tracks Buildbot ToT pretty closely
     65   * For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
     66   * FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
     67 * Qt is working on a tool to run just the relevant tests to your change, based on code coverage
     68   * Still doesn't deal with flakey tests well
    7069
    7170
    72 Infrastructure/waterfall improvements
    73         More EWS bots?
    74                 FIXME: Get the Mac EWS bots running the tests
    75         Easier way to identify a commit that caused a test failure
    76         Automatic notifications of build/test failure
    77                 The issue here is that when there are infrastructure problems, everything got a notification
    78                 FIXME: We should bring back the automatic notifications from sheriffbot
    79                
    80 
    81 Distributing builds / testing (e.g. diskcc)
    82         Chromium does some distributed compilation, speeds up builds quite a lot
    83         Chromium sometimes splits up tests - runs half on one machine and half on another
    84                 dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync
     71== Infrastructure/waterfall improvements ==
     72 * More EWS bots?
     73   * FIXME: Get the Mac EWS bots running the tests
     74 * Easier way to identify a commit that caused a test failure
     75 * Automatic notifications of build/test failure
     76   * The issue here is that when there are infrastructure problems, everything got a notification
     77   * FIXME: We should bring back the automatic notifications from sheriffbot
    8578
    8679
    87 Pixel tests
    88         Could try to convert more to ref tests
    89         Garden-O-Matic will help with this
    90         Are pixel tests worth it?
    91                 Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
    92                 Neither Chromium nor Apple has run a cost / benefit analysis of them
    93                 One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
    94                 It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
    95                         This results in a lot of skipping
    96 
    97 Discussion of the No Commits Until It Is Green policy
    98         Would need to be actively enforced by tools
    99         Chromium has this policy today but
    100                 there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
    101                 they have been able to isolate machine-related flakiness
    102                 they have tryjobs, unit tests, all required to be run before checking in
    103         Probably not feasible yet for WebKit anyways due to tooling
    104                 Need a way to grab new baselines for all ports
    105                 Need EWS support for tests for all ports
     80== Distributing builds / testing (e.g. diskcc) ==
     81 * Chromium does some distributed compilation, speeds up builds quite a lot
     82 * Chromium sometimes splits up tests - runs half on one machine and half on another
     83   * dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync
    10684
    10785
    108 Discussion of causes of flakey tests
    109         1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
    110                 dpranke may have something for this soon
    111         2. Tests run before
    112                 e.g. setTimeout that gets applied to the next test
    113                 Seems like we could improve our tools for that
    114         3. Memory corruption
    115         dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness
     86== Pixel tests ==
     87 * Could try to convert more to ref tests
     88 * Garden-O-Matic will help with this
     89 * Are pixel tests worth it?
     90   * Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
     91   * Neither Chromium nor Apple has run a cost / benefit analysis of them
     92   * One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
     93   * It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
     94     * This results in a lot of skipping
     95
     96== Discussion of the No Commits Until It Is Green policy ==
     97 * Would need to be actively enforced by tools
     98 * Chromium has this policy today but
     99   * there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
     100   * they have been able to isolate machine-related flakiness
     101   * they have tryjobs, unit tests, all required to be run before checking in
     102 * Probably not feasible yet for WebKit anyways due to tooling
     103   * Need a way to grab new baselines for all ports
     104   * Need EWS support for tests for all ports
    116105
    117106
    118 Workflow improvements
    119         More emphasis on running tests before committing
    120         Easier to generate platform-specific test results
    121         Standard process
    122                 If a test is failing, there should be one standard way of dealing with it
    123                 We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)
     107== Discussion of causes of flakey tests ==
     108 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
     109   * dpranke may have something for this soon
     110 2. Tests run before
     111   * e.g. setTimeout that gets applied to the next test
     112   * Seems like we could improve our tools for that
     113 3. Memory corruption
     114 * dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness
    124115
    125                
    126 Skipped tests are technical debt
    127         More emphasis on unskipping?
     116
     117== Workflow improvements ==
     118 * More emphasis on running tests before committing
     119 * Easier to generate platform-specific test results
     120 * Standard process
     121   * If a test is failing, there should be one standard way of dealing with it
     122   * We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)
     123
     124== Skipped tests are technical debt ==
     125 * More emphasis on unskipping?