Flaky Tests: A Field Guide to Root Causes and Real Fixes
A test that passes on retry is not fixed — it's lying. The five real causes of flakiness, how to diagnose each one, and why auto-retry is a last resort, not a strategy.
A flaky test is one that passes and fails on the same code. Most teams respond with retries, which converts a loud signal into a silent tax: CI gets slower, real regressions hide behind "just re-run it", and trust in the suite erodes. Nearly all flakiness traces back to five root causes — and each has a specific fix.
1. Shared mutable state between tests
The most common cause. A test mutates a database row, a module-level variable, or a singleton, and a later test depends on the pristine value. The giveaway: the test passes alone but fails in the full suite (or only fails when run in a particular order).
# Diagnose: run in random order. If failures move around, it's state.
pytest -p randomly
vitest --sequence.shuffle
Fix: each test creates and owns its data. Use transactions rolled back per test, unique identifiers (user-${testId}), or fresh fixtures — never a shared seed row that tests "promise" not to touch.
2. Time
Anything that reads the wall clock is a flake waiting for midnight, month-end, a leap year, or a slow CI runner. expect(token.expiresAt).toBe(Date.now() + 3600_000) fails whenever the two Date.now() calls straddle a millisecond.
// Freeze the clock — don't assert against the real one
vi.useFakeTimers();
vi.setSystemTime(new Date("2026-06-13T10:00:00Z"));
3. Unawaited asynchrony
Asserting before work finishes: a missing await, a fire-and-forget promise, an event handler that hasn't run yet. These pass on fast machines and fail under CI load — which is why "works on my machine" is the classic symptom.
// Bad: sleep and hope
await sleep(500);
expect(screen.getByText("Saved")).toBeVisible();
// Good: wait for the condition itself
await screen.findByText("Saved"); // polls until it appears or times out
Every sleep(n) in a test is a bug with a timer attached. Replace it with an explicit wait on the condition you actually care about.
4. Order-dependent collections
Databases without ORDER BY, Promise.all results consumed by index, Go map iteration — all return elements in an order that is usually stable and occasionally isn't. Assert on set membership or sort before comparing.
5. Real network and real infrastructure
Tests that hit live APIs inherit every outage, rate limit, and latency spike of that dependency. Stub at the boundary (msw, nock, a fake S3) for unit/integration tiers, and quarantine the genuinely end-to-end tests into a separate, non-blocking job.
What about retries?
Retries are acceptable as a containment measure with a paper trail: retry once, but record every retried test and treat the list as a bug queue. A retry policy without that feedback loop is just a flakiness amnesty.
The triage workflow
- Reproduce locally with the suite's random seed (
--seedfrom the failing CI run). - Bisect: run the failing test together with halves of the suite to find the polluting test.
- Fix the root cause; if you can't within a day, quarantine the test (skip + ticket) — a quarantined test is honest, a retried one is not.