Changes between Initial Version and Version 1 of April 2012 Keeping the bots green


Ignore:
Timestamp:
Apr 20, 2012 3:44:29 PM (12 years ago)
Author:
jberlin@webkit.org
Comment:

Notes for the keeping the bots green session

Legend:

Unmodified
Added
Removed
Modified
  • April 2012 Keeping the bots green

    v1 v1  
     1Keeping the bots green
     2----------------------
     3
     4Things that cause bots to go red
     5        Commits that break the build
     6        Commits that break tests
     7        Flakey tests
     8        Script bugs
     9        Flakey machines, OS bugs
     10                Kernel panics, overloaded system components
     11                Limited disk space on build.webkit.org (currently being addressed)
     12
     13
     14Improve build state feedback for committers
     15        Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
     16        Tester bots get behind
     17        Testers don't test the latest build
     18        Waterfall page doesn’t show enough info
     19                Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
     20        If the bot is already red, it is hard to tell if your change made it worse
     21                Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)
     22
     23
     24Gardening approaches
     25        Apple’s
     26                2 people watching the bots concurrently every week (just during regular work hours)
     27                Builds hovering close to green, but still go red multiple times a day
     28        Chromium’s
     29                3 sets of bots
     30                        chromium.org - ToT Chromium with a fixed WebKit
     31                                Very strict green policy, aggressively rolling out changes that make the bots red
     32                                Mostly green with spots of red
     33                        build.webkit.org - ToT WebKit with a fixed version of Chromium
     34                                Ad hoc maintenance
     35                                Red most of the time
     36                        Canaries - ToT Chromium with ToT WebKit
     37                                Rotating shifts of 1 person for 4 days at a time
     38                                Around the clock coverage
     39                                Most of the time spent updating baselines because the Chromium bots run the pixel tests
     40                                Also red most of the time
     41                                This is what the gardeners look at first
     42        Qt has Ossy
     43                Leading a group of 6 people
     44                They have found that it requires experience to identify which checkins caused which failures
     45        GTK
     46                Makes use of TestFailures + build.webkit.org/waterfall
     47                For a while they had one person per day gardening, but right now it is up to individual contributors
     48
     49
     50Can we share more tools?
     51        Ideally we should all use the same status page to determine if something broke
     52        Darin's ideal tool
     53                Identifies when something is broken that wasn't broken before
     54                Present the suspect set of revisions
     55                Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
     56        Garden-O-Matic
     57                Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
     58                        Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
     59                        FIXME: bugs for these?
     60                Client-side tool, runs a local webserver
     61                Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
     62        We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
     63                Full Buildbot is not checked into WebKit, only some configuration files
     64                        For build.webkit.org, changing config files automatically restarts master
     65                        The Chromium infrastructure team tracks Buildbot ToT pretty closely
     66                For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
     67                FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
     68        Qt is working on a tool to run just the relevant tests to your change, based on code coverage
     69                Still doesn't deal with flakey tests well
     70
     71
     72Infrastructure/waterfall improvements
     73        More EWS bots?
     74                FIXME: Get the Mac EWS bots running the tests
     75        Easier way to identify a commit that caused a test failure
     76        Automatic notifications of build/test failure
     77                The issue here is that when there are infrastructure problems, everything got a notification
     78                FIXME: We should bring back the automatic notifications from sheriffbot
     79               
     80
     81Distributing builds / testing (e.g. diskcc)
     82        Chromium does some distributed compilation, speeds up builds quite a lot
     83        Chromium sometimes splits up tests - runs half on one machine and half on another
     84                dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync
     85
     86
     87Pixel tests
     88        Could try to convert more to ref tests
     89        Garden-O-Matic will help with this
     90        Are pixel tests worth it?
     91                Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
     92                Neither Chromium nor Apple has run a cost / benefit analysis of them
     93                One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
     94                It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
     95                        This results in a lot of skipping
     96
     97Discussion of the No Commits Until It Is Green policy
     98        Would need to be actively enforced by tools
     99        Chromium has this policy today but
     100                there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
     101                they have been able to isolate machine-related flakiness
     102                they have tryjobs, unit tests, all required to be run before checking in
     103        Probably not feasible yet for WebKit anyways due to tooling
     104                Need a way to grab new baselines for all ports
     105                Need EWS support for tests for all ports
     106
     107
     108Discussion of causes of flakey tests
     109        1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
     110                dpranke may have something for this soon
     111        2. Tests run before
     112                e.g. setTimeout that gets applied to the next test
     113                Seems like we could improve our tools for that
     114        3. Memory corruption
     115        dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness
     116
     117
     118Workflow improvements
     119        More emphasis on running tests before committing
     120        Easier to generate platform-specific test results
     121        Standard process
     122                If a test is failing, there should be one standard way of dealing with it
     123                We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)
     124
     125               
     126Skipped tests are technical debt
     127        More emphasis on unskipping?