7 Steps To Fix Flaky Tests

4 min readFeb 14, 2022

91% of software developers admit to having defects which remain unresolved because they cannot reproduce the issue.

Slowed down by a growing backlog of test failures?

CI/CD and test automation help with delivering quality software at speed, but they bring with them a growing backlog of failing tests. Test suites are plagued by hundreds (perhaps thousands) of failing tests, and trying to bring these under control can feel overwhelming.

Ignoring these failing tests means leaving bugs to spread across your software and infect larger and larger areas of it. And putting in more hours just to keep pace is simply ignoring the deteriorating health of your product.

Intermittent failing tests (aka flaky tests) are major contributors to a growing backlog of failing tests. They are hard to reproduce, so they remain unresolved; and some of these test failures contain defects that end up escaping into production and negatively impacting on customers.

The impact is not just external, though. Seeing a lot of test failures is demoralizing for highly skilled engineers who want to write high-quality code which doesn’t break all the time. Playing whack-a-mole with your software is not what your best engineers sign up for. They care about producing new features rather than spending all their time fixing broken parts of your software. If the engineering leadership is not taking this issue seriously, your best engineers will leave (sometimes heading straight to your competitors).

1. Alert the rest of the business and gain buy-in to get the problem fixed

Once you’ve realized the problem of flaky tests is real and that it comes with significant costs — increased engineering costs, impediment to delivery speed, staff burnout and turnover, morale issues — it’s time to alert the rest of the business.

Ensure that management at all levels is aware of the need for change. Make noise in planning meetings to highlight the problem and its consequences to the engineering team, the business, and customers.

2. Introduce a culture of zero tolerance to test failures

There is no such thing as an “acceptable” test failure. It doesn’t matter whether it’s caused by infrastructure, test, or product code. Like an infection, the longer you leave it, the worse it is going to get, and the harder (and the more expensive!) it will be to get on top of it. This is the message to communicate to the engineering organization through code quality seminars, lunch & learn talks, and other internal communication channels.

3. Set up infrastructure you can trust

Don’t skimp on your testing hardware. If possible, rent and use scalable, cloud-based systems. If you can’t rent, then make sure you buy big enough. If the infrastructure is not rock solid, you cannot achieve reliable testing; and if it’s under-provisioned, it’s definitely not going to be rock solid.

4. Organize your tests into modules and suites

If you haven’t already done so, you’ll need to set up a rapid-feedback test suite for developers and a larger integration test suite for release-readiness. Tests need to be associated with modules so that engineers can be sure which tests they are responsible for fixing, and why.

It is also advisable to quarantine your flaky tests to prevent disruption to your CI/CD pipeline: i.e. take flaky tests out of your delivery pipeline into a ‘sporadic tests’ suite (for all sporadically failing tests).

5. Agree a set of best practices for writing reliable tests

List a set of criteria that are often signs that a test is unreliable. For example, hard-coded timeouts/sleeps. If a test relies on something else happening during a sleep, then it’s likely that the sleep won’t be long enough sometimes (unless you make all these sleeps super long, in which case the tests will take too long to run).

And if you’re not already doing it, it’s best practice to introduce code-reviewing tests to make sure they are reliable.

6. Give engineers the space, resources, and mandate to fix flaky tests

Set aside developer resources to address known flakiness:

Give them the space and time to debug flaky tests
Empower them to quickly and efficiently diagnose the root cause of flakiness, by giving them the right tools for the job

This isn’t easy. In almost all development organizations, developers are a scarce resource and there is lots of pressure to deliver short-term features. But the reality is that there are no easy answers — you have to invest to achieve the goal. The returns on that investment though will be manyfold, and they may well come sooner than you think.

Top tip: using LiveRecorder, engineers can automatically run their intermittently failing tests, in a loop under recording, until they fail. Once they fail, engineers have everything they need inside the recording to quickly diagnose the problem. All they have to do is debug the recording.

7. Prevent new regressions

Once flakiness is under control, treat every new test failure as a serious issue and prevent flakiness from creeping back in. Find ways to ensure developers are notified directly (and ideally automatically) about test failures from code that they are directly responsible for.

Top tip: make sure the notification email content is relevant, so that developers don’t end up ignoring the messages.

Once you get to this point, keeping your test suite green is relatively easy. Your team can breathe again… and so can your product!

Summary

Flaky tests are not a fact of life in software engineering. They can be controlled.

No software engineering team can be described as “first-class” if they don’t have their tests under control. If your competition is getting this right and you don’t, you are at risk of being left behind.

Originally published at https://undo.io.