Flaky Tests: A Field Guide to Root Causes and Real Fixes

A test that passes on retry is not fixed — it's lying. The five real causes of flakiness, how to diagnose each one, and why auto-retry is a last resort, not a strategy.

June 13, 20269 min readTestingCI

A flaky test is one that passes and fails on the same code. Most teams respond with retries, which converts a loud signal into a silent tax: CI gets slower, real regressions hide behind "just re-run it", and trust in the suite erodes. Nearly all flakiness traces back to five root causes — and each has a specific fix.

1. Shared mutable state between tests

The most common cause. A test mutates a database row, a module-level variable, or a singleton, and a later test depends on the pristine value. The giveaway: the test passes alone but fails in the full suite (or only fails when run in a particular order).

# Diagnose: run in random order. If failures move around, it's state.
pytest -p randomly
vitest --sequence.shuffle

Fix: each test creates and owns its data. Use transactions rolled back per test, unique identifiers (user-${testId}), or fresh fixtures — never a shared seed row that tests "promise" not to touch.

2. Time

Anything that reads the wall clock is a flake waiting for midnight, month-end, a leap year, or a slow CI runner. expect(token.expiresAt).toBe(Date.now() + 3600_000) fails whenever the two Date.now() calls straddle a millisecond.

// Freeze the clock — don't assert against the real one
vi.useFakeTimers();
vi.setSystemTime(new Date("2026-06-13T10:00:00Z"));

3. Unawaited asynchrony

Asserting before work finishes: a missing await, a fire-and-forget promise, an event handler that hasn't run yet. These pass on fast machines and fail under CI load — which is why "works on my machine" is the classic symptom.

// Bad: sleep and hope
await sleep(500);
expect(screen.getByText("Saved")).toBeVisible();

// Good: wait for the condition itself
await screen.findByText("Saved");   // polls until it appears or times out

Every sleep(n) in a test is a bug with a timer attached. Replace it with an explicit wait on the condition you actually care about.

4. Order-dependent collections

Databases without ORDER BY, Promise.all results consumed by index, Go map iteration — all return elements in an order that is usually stable and occasionally isn't. Assert on set membership or sort before comparing.

5. Real network and real infrastructure

Tests that hit live APIs inherit every outage, rate limit, and latency spike of that dependency. Stub at the boundary (msw, nock, a fake S3) for unit/integration tiers, and quarantine the genuinely end-to-end tests into a separate, non-blocking job.

What about retries?

Retries are acceptable as a containment measure with a paper trail: retry once, but record every retried test and treat the list as a bug queue. A retry policy without that feedback loop is just a flakiness amnesty.

The triage workflow

Reproduce locally with the suite's random seed (--seed from the failing CI run).
Bisect: run the failing test together with halves of the suite to find the polluting test.
Fix the root cause; if you can't within a day, quarantine the test (skip + ticket) — a quarantined test is honest, a retried one is not.

Flaky Tests: A Field Guide to Root Causes and Real Fixes

1. Shared mutable state between tests

2. Time

3. Unawaited asynchrony

4. Order-dependent collections

5. Real network and real infrastructure

What about retries?

The triage workflow

1 replies// weighed in

More from this topic

The Test Pyramid Is a Lie: Modern Layering

Testcontainers: Real Dependencies, Reproducible Tests