AI's "overnight" solution for our flaky tests took two weeks to adopt

Recently I stopped a group of flaky tests from running in CI. 60% of CI runs were failing because of this group, which was unsustainable. Three weeks later I was able to restore that group to CI, with 0% failures on main¹ resulting. Our “non-flaky” tests now give more false positives than the (previously) flaky group.

This is not really a post about tests though, it’s really about AI’s contribution (a lot) and what it took to make that contribution usable (also a lot).

The hardest problem

Developers on this project had been quarantining tests with a :flaky label for several years. The strategy was to quarantine a small group which could be expected to fail randomly but could also be re-run easily and separately from the full suite. Apart from the flakiness, the test suite is comprehensive and gives us high confidence that if we merge something after tests pass, it works.

Over the years, several developers had tried for a week at a time to reduce flakiness, all resulting in failure. In our defense, the flaky tests centred around interactive pages using Stimulus or Hotwire, and online discussion of this topic is a combination of ideas we tried already, plus someone saying: “I tried a lot, it doesn’t work, I think there’s a bug”.

The most promising angle was adopting Playwright, which did improve some things but also left us with some tests that failed permanently and needed to be skipped. There’s a dissatisfying way in which this is better than tests that only fail some of the time.

The problem started to look more and more like a trap set for enthusiastic developers. As a manager I always had to urge caution: “sure, you can see some approaches that could help, but bear in mind the last five times anyone tried they found very promising angles that didn’t change the stats in github at all”. Developers whom I trust were seriously recommending deleting the entire group.

Opus “solved it” overnight

One night, Opus 4.6 running in Claude Code solved “the problem” by running the flaky test group hundreds of times and analyzing failures. There was some prompting to help Claude avoid premature conclusions and be aware that the problems could not be reproduced without repetition, plus a markdown file where it would record progress. Otherwise, no special magic.

I could see Claude’s progress over time because it needed to run the flaky group in larger and larger batches. At first, five times was sufficient because the errors it found occurred 20% of the time. As those were fixed, I had to tell it to use batches of ten, fifty, and then one hundred. Finally, it reached a point where zero errors were found.

A “nice” thing about needing such large batches is that I could leave Claude alone for hours at a time while my normal evening continued. Flaky specs may be a problem uniquely suited to coding agents in that way. There’s not even much token use: it just kicks off a long run and surfaces for an internal conversation, then kicks off the next batch.

Two weeks to make the results useful

This isn’t a post about test failure strategy, so I’ll spare you details of what was flaky and what fixes applied. Instead I’ll try to communicate some of the meta concerns I had with the resulting code changes.

Given a test that looked something like this:

create objects
visit page
click A
click B
expect expression 1 to be true
click C
expect expression 2 to be true

Unchecked, Claude would have turned it into something like this:

create objects in a slightly different way that makes no difference
visit page
explicit sleep
unnecessary scoping to a specific section of the page
  click A
end of unnecessary scoping
click B, with 3 second wait passed as option arg
a clever improvement that should have been on line 3
expect expression 1 to be true
click C
an improvement that worked in other tests but was irrelevant here
expect expression 2 to be true

Ultimately the changes added up to a good improvement, usually because of one crucial addition per test (in our fictional example, line 8) that was on the wrong line and hidden in a mountain of garbage (lines 3, 4, 6, 7, 11).

It took two weeks to:

separate coincidence from real results
remove the things that didn’t make a difference
apply good practice to the important differences
unify slight variations on the same changes
generalise to other parts of the test suite
make sensible commits

Some of this work was just a matter of applying good practice (e.g. any explicit sleep call is immediately suspect), and other times it was sending Claude back to hundreds of test runs to prove that something it had added made no difference.

Conclusion: processing my reactions

I see in myself three reactions.

1. Hooray, I’m still useful as a programmer!

I think it would have been impossible without lots of experience working with Rails and rspec to move from what Claude was suggesting initially towards something sustainable². The exact amount of experience necessary is uncertain, but I’m on more than ten years. It took a lot to move beyond the optimism and false positives, and it would have taken more if I didn’t already have a reasonable gut instinct about these things.

2. Boy, AI is awful! Why bother with it if it takes so long to use the results?

I would absolutely use (and recommend) Claude for analysing flaky tests again. I think it would be a mistake not to do so. Accurately running long processes with tiny changes in between multi-hour waits is not a strength for humans.

In addition, Claude did reason through code running in parallel processes in a way that no human had managed for years. That particular part of our code is complex, but has not had active work for years, meaning that no human has good context. Claude probably caught up in 10 minutes.

An interesting aside here is that I find Claude to do much better work when it has tests to help it reason about application code. The tests were flaky, but they were still a good record of what the code was supposed to do.

3. Why keep going for two weeks after AI clearly fixed the problem I care about in one evening?

I could have taken the win, ignored the cruft, and gained two weeks. If I had, I would have lost those two weeks and more later on. Humans and AI agents would cargo cult the new (anti) patterns, falsely claiming victory over any future flakiness, and making it harder to identify the real problems.

As with all programming, eventually “tidy first, then do the work” ends up being faster than “just do the work”. There’s no escaping the tidying if I want good results, the question is whether I do it at a predictable time and pace or when there’s an emergency (like no-one being able to deploy any code because CI keeps failing).

That includes tidying up after AI.

commits on main are a proxy for “code that should pass tests”, as opposed to work-in-progress commits, which also go through CI and fail tests for real reasons. ↩
this was Opus 4.6, but nothing I’ve seen of later versions of Opus gives me confidence that humans are less necessary here. ↩