wiki:April 2012 Keeping the bots green

Version 2 (modified by aroben@webkit.org, 12 years ago) (diff)

Fix indentation, add real headings

Keeping the bots green

Things that cause bots to go red

  • Commits that break the build
  • Commits that break tests
  • Flakey tests
  • Script bugs
  • Flakey machines, OS bugs
    • Kernel panics, overloaded system components
    • Limited disk space on build.webkit.org (currently being addressed)

Improve build state feedback for committers

  • Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything)
  • Tester bots get behind
  • Testers don't test the latest build
  • Waterfall page doesn’t show enough info
    • Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things
  • If the bot is already red, it is hard to tell if your change made it worse
    • Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)

Gardening approaches

  • Apple’s
    • 2 people watching the bots concurrently every week (just during regular work hours)
    • Builds hovering close to green, but still go red multiple times a day
  • Chromium’s
    • 3 sets of bots
      • chromium.org - ToT Chromium with a fixed WebKit
        • Very strict green policy, aggressively rolling out changes that make the bots red
        • Mostly green with spots of red
      • build.webkit.org - ToT WebKit with a fixed version of Chromium
        • Ad hoc maintenance
        • Red most of the time
      • Canaries - ToT Chromium with ToT WebKit
        • Rotating shifts of 1 person for 4 days at a time
        • Around the clock coverage
        • Most of the time spent updating baselines because the Chromium bots run the pixel tests
        • Also red most of the time
        • This is what the gardeners look at first
  • Qt has Ossy
    • Leading a group of 6 people
    • They have found that it requires experience to identify which checkins caused which failures
  • GTK
    • Makes use of TestFailures + build.webkit.org/waterfall
    • For a while they had one person per day gardening, but right now it is up to individual contributors

Can we share more tools?

  • Ideally we should all use the same status page to determine if something broke
  • Darin's ideal tool
    • Identifies when something is broken that wasn't broken before
    • Present the suspect set of revisions
    • Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people
  • Garden-O-Matic
    • Built on top of code in webkit.py in the WebKit tree, designed to be used by any port
      • Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports
      • FIXME: bugs for these?
    • Client-side tool, runs a local webserver
    • Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree
  • We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org
    • Full Buildbot is not checked into WebKit, only some configuration files
      • For build.webkit.org, changing config files automatically restarts master
      • The Chromium infrastructure team tracks Buildbot ToT pretty closely
    • For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views
    • FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?
  • Qt is working on a tool to run just the relevant tests to your change, based on code coverage
    • Still doesn't deal with flakey tests well

Infrastructure/waterfall improvements

  • More EWS bots?
    • FIXME: Get the Mac EWS bots running the tests
  • Easier way to identify a commit that caused a test failure
  • Automatic notifications of build/test failure
    • The issue here is that when there are infrastructure problems, everything got a notification
    • FIXME: We should bring back the automatic notifications from sheriffbot

Distributing builds / testing (e.g. diskcc)

  • Chromium does some distributed compilation, speeds up builds quite a lot
  • Chromium sometimes splits up tests - runs half on one machine and half on another
    • dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync

Pixel tests

  • Could try to convert more to ref tests
  • Garden-O-Matic will help with this
  • Are pixel tests worth it?
    • Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them
    • Neither Chromium nor Apple has run a cost / benefit analysis of them
    • One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs
    • It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression
      • This results in a lot of skipping

Discussion of the No Commits Until It Is Green policy

  • Would need to be actively enforced by tools
  • Chromium has this policy today but
    • there aren't that many flaky tests (most already marked as flakey in test_expectations.txt)
    • they have been able to isolate machine-related flakiness
    • they have tryjobs, unit tests, all required to be run before checking in
  • Probably not feasible yet for WebKit anyways due to tooling
    • Need a way to grab new baselines for all ports
    • Need EWS support for tests for all ports

Discussion of causes of flakey tests

  1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded
    • dpranke may have something for this soon
  2. Tests run before
    • e.g. setTimeout that gets applied to the next test
    • Seems like we could improve our tools for that
  3. Memory corruption
  • dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness

Workflow improvements

  • More emphasis on running tests before committing
  • Easier to generate platform-specific test results
  • Standard process
    • If a test is failing, there should be one standard way of dealing with it
    • We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)

Skipped tests are technical debt

  • More emphasis on unskipping?