wiki:April 2012 Keeping the bots green

Version 1 (modified by jberlin@webkit.org, 12 years ago) (diff)

Notes for the keeping the bots green session

Keeping the bots green


Things that cause bots to go red

Commits that break the build Commits that break tests Flakey tests Script bugs Flakey machines, OS bugs

Kernel panics, overloaded system components Limited disk space on build.webkit.org (currently being addressed)

Improve build state feedback for committers

Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) Tester bots get behind Testers don't test the latest build Waterfall page doesn’t show enough info

Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things

If the bot is already red, it is hard to tell if your change made it worse

Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing)

Gardening approaches

Apple’s

2 people watching the bots concurrently every week (just during regular work hours) Builds hovering close to green, but still go red multiple times a day

Chromium’s

3 sets of bots

chromium.org - ToT Chromium with a fixed WebKit

Very strict green policy, aggressively rolling out changes that make the bots red Mostly green with spots of red

build.webkit.org - ToT WebKit with a fixed version of Chromium

Ad hoc maintenance Red most of the time

Canaries - ToT Chromium with ToT WebKit

Rotating shifts of 1 person for 4 days at a time Around the clock coverage Most of the time spent updating baselines because the Chromium bots run the pixel tests Also red most of the time This is what the gardeners look at first

Qt has Ossy

Leading a group of 6 people They have found that it requires experience to identify which checkins caused which failures

GTK

Makes use of TestFailures + build.webkit.org/waterfall For a while they had one person per day gardening, but right now it is up to individual contributors

Can we share more tools?

Ideally we should all use the same status page to determine if something broke Darin's ideal tool

Identifies when something is broken that wasn't broken before Present the suspect set of revisions Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people

Garden-O-Matic

Built on top of code in webkit.py in the WebKit tree, designed to be used by any port

Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports FIXME: bugs for these?

Client-side tool, runs a local webserver Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree

We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org

Full Buildbot is not checked into WebKit, only some configuration files

For build.webkit.org, changing config files automatically restarts master The Chromium infrastructure team tracks Buildbot ToT pretty closely

For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org?

Qt is working on a tool to run just the relevant tests to your change, based on code coverage

Still doesn't deal with flakey tests well

Infrastructure/waterfall improvements

More EWS bots?

FIXME: Get the Mac EWS bots running the tests

Easier way to identify a commit that caused a test failure Automatic notifications of build/test failure

The issue here is that when there are infrastructure problems, everything got a notification FIXME: We should bring back the automatic notifications from sheriffbot

Distributing builds / testing (e.g. diskcc)

Chromium does some distributed compilation, speeds up builds quite a lot Chromium sometimes splits up tests - runs half on one machine and half on another

dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync

Pixel tests

Could try to convert more to ref tests Garden-O-Matic will help with this Are pixel tests worth it?

Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them Neither Chromium nor Apple has run a cost / benefit analysis of them One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression

This results in a lot of skipping

Discussion of the No Commits Until It Is Green policy

Would need to be actively enforced by tools Chromium has this policy today but

there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) they have been able to isolate machine-related flakiness they have tryjobs, unit tests, all required to be run before checking in

Probably not feasible yet for WebKit anyways due to tooling

Need a way to grab new baselines for all ports Need EWS support for tests for all ports

Discussion of causes of flakey tests

  1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded

dpranke may have something for this soon

  1. Tests run before

e.g. setTimeout that gets applied to the next test Seems like we could improve our tools for that

  1. Memory corruption dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness

Workflow improvements

More emphasis on running tests before committing Easier to generate platform-specific test results Standard process

If a test is failing, there should be one standard way of dealing with it We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?)

Skipped tests are technical debt

More emphasis on unskipping?