4 | | Things that cause bots to go red |
5 | | Commits that break the build |
6 | | Commits that break tests |
7 | | Flakey tests |
8 | | Script bugs |
9 | | Flakey machines, OS bugs |
10 | | Kernel panics, overloaded system components |
11 | | Limited disk space on build.webkit.org (currently being addressed) |
| 3 | == Things that cause bots to go red == |
| 4 | * Commits that break the build |
| 5 | * Commits that break tests |
| 6 | * Flakey tests |
| 7 | * Script bugs |
| 8 | * Flakey machines, OS bugs |
| 9 | * Kernel panics, overloaded system components |
| 10 | * Limited disk space on build.webkit.org (currently being addressed) |
14 | | Improve build state feedback for committers |
15 | | Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) |
16 | | Tester bots get behind |
17 | | Testers don't test the latest build |
18 | | Waterfall page doesn’t show enough info |
19 | | Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things |
20 | | If the bot is already red, it is hard to tell if your change made it worse |
21 | | Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing) |
| 13 | == Improve build state feedback for committers == |
| 14 | * Build/test cycle is too damn long (4 or 5 hour delay to see if you broke anything) |
| 15 | * Tester bots get behind |
| 16 | * Testers don't test the latest build |
| 17 | * Waterfall page doesn’t show enough info |
| 18 | * Need a lot of usability improvements to make it easier for developers to use the waterfall/console to determine if they broke things |
| 19 | * If the bot is already red, it is hard to tell if your change made it worse |
| 20 | * Perhaps we should add (back?) colors for the degree of breakage (e.g. orange means only a few tests are failing, red means lots of tests are failing) |
24 | | Gardening approaches |
25 | | Apple’s |
26 | | 2 people watching the bots concurrently every week (just during regular work hours) |
27 | | Builds hovering close to green, but still go red multiple times a day |
28 | | Chromium’s |
29 | | 3 sets of bots |
30 | | chromium.org - ToT Chromium with a fixed WebKit |
31 | | Very strict green policy, aggressively rolling out changes that make the bots red |
32 | | Mostly green with spots of red |
33 | | build.webkit.org - ToT WebKit with a fixed version of Chromium |
34 | | Ad hoc maintenance |
35 | | Red most of the time |
36 | | Canaries - ToT Chromium with ToT WebKit |
37 | | Rotating shifts of 1 person for 4 days at a time |
38 | | Around the clock coverage |
39 | | Most of the time spent updating baselines because the Chromium bots run the pixel tests |
40 | | Also red most of the time |
41 | | This is what the gardeners look at first |
42 | | Qt has Ossy |
43 | | Leading a group of 6 people |
44 | | They have found that it requires experience to identify which checkins caused which failures |
45 | | GTK |
46 | | Makes use of TestFailures + build.webkit.org/waterfall |
47 | | For a while they had one person per day gardening, but right now it is up to individual contributors |
| 23 | == Gardening approaches == |
| 24 | * Apple’s |
| 25 | * 2 people watching the bots concurrently every week (just during regular work hours) |
| 26 | * Builds hovering close to green, but still go red multiple times a day |
| 27 | * Chromium’s |
| 28 | * 3 sets of bots |
| 29 | * chromium.org - ToT Chromium with a fixed WebKit |
| 30 | * Very strict green policy, aggressively rolling out changes that make the bots red |
| 31 | * Mostly green with spots of red |
| 32 | * build.webkit.org - ToT WebKit with a fixed version of Chromium |
| 33 | * Ad hoc maintenance |
| 34 | * Red most of the time |
| 35 | * Canaries - ToT Chromium with ToT WebKit |
| 36 | * Rotating shifts of 1 person for 4 days at a time |
| 37 | * Around the clock coverage |
| 38 | * Most of the time spent updating baselines because the Chromium bots run the pixel tests |
| 39 | * Also red most of the time |
| 40 | * This is what the gardeners look at first |
| 41 | * Qt has Ossy |
| 42 | * Leading a group of 6 people |
| 43 | * They have found that it requires experience to identify which checkins caused which failures |
| 44 | * GTK |
| 45 | * Makes use of TestFailures + build.webkit.org/waterfall |
| 46 | * For a while they had one person per day gardening, but right now it is up to individual contributors |
50 | | Can we share more tools? |
51 | | Ideally we should all use the same status page to determine if something broke |
52 | | Darin's ideal tool |
53 | | Identifies when something is broken that wasn't broken before |
54 | | Present the suspect set of revisions |
55 | | Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people |
56 | | Garden-O-Matic |
57 | | Built on top of code in webkit.py in the WebKit tree, designed to be used by any port |
58 | | Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports |
59 | | FIXME: bugs for these? |
60 | | Client-side tool, runs a local webserver |
61 | | Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree |
62 | | We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org |
63 | | Full Buildbot is not checked into WebKit, only some configuration files |
64 | | For build.webkit.org, changing config files automatically restarts master |
65 | | The Chromium infrastructure team tracks Buildbot ToT pretty closely |
66 | | For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views |
67 | | FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org? |
68 | | Qt is working on a tool to run just the relevant tests to your change, based on code coverage |
69 | | Still doesn't deal with flakey tests well |
| 49 | == Can we share more tools? == |
| 50 | * Ideally we should all use the same status page to determine if something broke |
| 51 | * Darin's ideal tool |
| 52 | * Identifies when something is broken that wasn't broken before |
| 53 | * Present the suspect set of revisions |
| 54 | * Once the gardener determines which revision is most likely the cause, provides a quick way to notify the relevant people |
| 55 | * Garden-O-Matic |
| 56 | * Built on top of code in webkit.py in the WebKit tree, designed to be used by any port |
| 57 | * Currently only works for the Chromium port - needs adjustment for different result formats and URLs used by different ports |
| 58 | * FIXME: bugs for these? |
| 59 | * Client-side tool, runs a local webserver |
| 60 | * Allows for you to browse failing tests, one-click to get the rebaseline applied to your local tree |
| 61 | * We should sit down and merge changes made to buildbot for build.chromium.org and build.webkit.org |
| 62 | * Full Buildbot is not checked into WebKit, only some configuration files |
| 63 | * For build.webkit.org, changing config files automatically restarts master |
| 64 | * The Chromium infrastructure team tracks Buildbot ToT pretty closely |
| 65 | * For build.webkit.org, can go back 25, 50, 100, etc on the waterfall views |
| 66 | * FIXME: Can we get the improvements that Ojan made to the console view for build.chromium.org added for build.webkit.org? |
| 67 | * Qt is working on a tool to run just the relevant tests to your change, based on code coverage |
| 68 | * Still doesn't deal with flakey tests well |
72 | | Infrastructure/waterfall improvements |
73 | | More EWS bots? |
74 | | FIXME: Get the Mac EWS bots running the tests |
75 | | Easier way to identify a commit that caused a test failure |
76 | | Automatic notifications of build/test failure |
77 | | The issue here is that when there are infrastructure problems, everything got a notification |
78 | | FIXME: We should bring back the automatic notifications from sheriffbot |
79 | | |
80 | | |
81 | | Distributing builds / testing (e.g. diskcc) |
82 | | Chromium does some distributed compilation, speeds up builds quite a lot |
83 | | Chromium sometimes splits up tests - runs half on one machine and half on another |
84 | | dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync |
| 71 | == Infrastructure/waterfall improvements == |
| 72 | * More EWS bots? |
| 73 | * FIXME: Get the Mac EWS bots running the tests |
| 74 | * Easier way to identify a commit that caused a test failure |
| 75 | * Automatic notifications of build/test failure |
| 76 | * The issue here is that when there are infrastructure problems, everything got a notification |
| 77 | * FIXME: We should bring back the automatic notifications from sheriffbot |
87 | | Pixel tests |
88 | | Could try to convert more to ref tests |
89 | | Garden-O-Matic will help with this |
90 | | Are pixel tests worth it? |
91 | | Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them |
92 | | Neither Chromium nor Apple has run a cost / benefit analysis of them |
93 | | One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs |
94 | | It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression |
95 | | This results in a lot of skipping |
96 | | |
97 | | Discussion of the No Commits Until It Is Green policy |
98 | | Would need to be actively enforced by tools |
99 | | Chromium has this policy today but |
100 | | there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) |
101 | | they have been able to isolate machine-related flakiness |
102 | | they have tryjobs, unit tests, all required to be run before checking in |
103 | | Probably not feasible yet for WebKit anyways due to tooling |
104 | | Need a way to grab new baselines for all ports |
105 | | Need EWS support for tests for all ports |
| 80 | == Distributing builds / testing (e.g. diskcc) == |
| 81 | * Chromium does some distributed compilation, speeds up builds quite a lot |
| 82 | * Chromium sometimes splits up tests - runs half on one machine and half on another |
| 83 | * dpranke looked at doing a master / gather, but it might not be worth the effort needed to keep the bots in sync |
108 | | Discussion of causes of flakey tests |
109 | | 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded |
110 | | dpranke may have something for this soon |
111 | | 2. Tests run before |
112 | | e.g. setTimeout that gets applied to the next test |
113 | | Seems like we could improve our tools for that |
114 | | 3. Memory corruption |
115 | | dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness |
| 86 | == Pixel tests == |
| 87 | * Could try to convert more to ref tests |
| 88 | * Garden-O-Matic will help with this |
| 89 | * Are pixel tests worth it? |
| 90 | * Every time Apple has tried to start running the pixel tests, it has been a struggle to maintain them |
| 91 | * Neither Chromium nor Apple has run a cost / benefit analysis of them |
| 92 | * One gardener said that about 60% of pixel test failures he saw when gardening have pointed out real bugs |
| 93 | * It is difficult for the gardener to tell with a lot of the tests whether a change is actually a regression |
| 94 | * This results in a lot of skipping |
| 95 | |
| 96 | == Discussion of the No Commits Until It Is Green policy == |
| 97 | * Would need to be actively enforced by tools |
| 98 | * Chromium has this policy today but |
| 99 | * there aren't that many flaky tests (most already marked as flakey in test_expectations.txt) |
| 100 | * they have been able to isolate machine-related flakiness |
| 101 | * they have tryjobs, unit tests, all required to be run before checking in |
| 102 | * Probably not feasible yet for WebKit anyways due to tooling |
| 103 | * Need a way to grab new baselines for all ports |
| 104 | * Need EWS support for tests for all ports |
118 | | Workflow improvements |
119 | | More emphasis on running tests before committing |
120 | | Easier to generate platform-specific test results |
121 | | Standard process |
122 | | If a test is failing, there should be one standard way of dealing with it |
123 | | We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?) |
| 107 | == Discussion of causes of flakey tests == |
| 108 | 1. Lots of tests don't work well in parallel due to reliance on system components that get overloaded |
| 109 | * dpranke may have something for this soon |
| 110 | 2. Tests run before |
| 111 | * e.g. setTimeout that gets applied to the next test |
| 112 | * Seems like we could improve our tools for that |
| 113 | 3. Memory corruption |
| 114 | * dpranke is working on having NRWT restart DRT on directory boundaries, found that it does reduce flakiness |
125 | | |
126 | | Skipped tests are technical debt |
127 | | More emphasis on unskipping? |
| 116 | |
| 117 | == Workflow improvements == |
| 118 | * More emphasis on running tests before committing |
| 119 | * Easier to generate platform-specific test results |
| 120 | * Standard process |
| 121 | * If a test is failing, there should be one standard way of dealing with it |
| 122 | * We don't currently have a good understanding of what is the best practice (do you roll it out? skip? land failing results? how long do you wait after notifying the committer?) |
| 123 | |
| 124 | == Skipped tests are technical debt == |
| 125 | * More emphasis on unskipping? |