Your p99 Is Lying: Coordinated Omission and the Latency You Never Measured

A load test reported a 12ms p99 while real users were timing out. The numbers weren't faked. The benchmark just stopped sending requests whenever the system got slow, so it never measured the slowness. That's coordinated omission, and it quietly poisons most latency stats.

June 18, 20268 min readObservability

I once stared at a load-test report showing a 12ms p99 and a 40ms max, then alt-tabbed to a dashboard where real users were seeing multi-second stalls and the odd timeout. Same service, same traffic level. Both sets of numbers were collected honestly. The benchmark wasn't lying on purpose; it was lying structurally, because of a flaw so common it has a name: coordinated omission. Once I understood it I stopped trusting almost every latency graph I'd ever drawn, including my own.

The setup that quietly cheats

Picture the classic closed-loop load generator: a worker sends a request, waits for the response, records the time, sends the next one. Repeat. It sounds rigorous. Here's the problem. Suppose the server hiccups and one request takes 1 second instead of 1ms. During that whole second the worker is blocked waiting, so it isn't sending new requests. You meant to fire 1000 requests in that second; you fired one, the slow one.

When the system finally responds, the worker resumes and everything's fast again. Your sample now contains one 1000ms data point and a flood of 1ms ones. The 999 requests that should have been sent during the stall, and would have queued up behind it and been slow too, were simply never sent. The benchmark coordinated with the system: it backed off exactly when things got bad, so it omitted the bad measurements. Hence the name.

Why this wrecks the high percentiles specifically

The average barely moves, which is part of why this hides so well. But percentiles are about the tail, and the tail is exactly what got deleted. You measured one bad request instead of the thousand bad requests reality would have produced, so your p99 is computed over a sample where the slow events are wildly underrepresented. The math then reports a gorgeous p99 that describes a system that doesn't exist. The worst part is it fails in the most flattering direction, so nobody questions it.

A concrete way to feel it

Say your service handles a request in 1ms normally, but once every 10 seconds it freezes for a full second. A user issuing requests steadily through that freeze experiences a whole second of requests that are slow, from "stalled for 1000ms" down to "stalled for 1ms", averaging ~500ms across that batch. Now run a naive closed-loop benchmark: it records one ~1000ms sample for the freeze and then races through thousands of 1ms samples. Its p99? Still near 1ms. The real p99, accounting for everyone stuck in the queue behind the freeze, is hundreds of milliseconds. Same system. The gap is entirely coordinated omission.

What "correct" looks like

The fix is to measure latency against a schedule, not against when the previous request happened to finish. You decide up front: a request is due every 1ms. If request N was supposed to start at time T but the system was busy and it didn't actually start until T+800ms, then its true latency includes that 800ms of waiting-to-even-start. You measure from intended send time, not actual send time.

// coordinated omission (wrong): clock starts when we send
send_at  = now()
response = call()
latency  = now() - send_at        // misses queueing during stalls

// schedule-based (right): clock starts when the request was DUE
due_at   = start + n * interval   // fixed cadence, set in advance
response = call()                 // may start late if system is busy
latency  = now() - due_at         // includes time spent waiting to send

This is exactly the correction Gil Tene built into HdrHistogram and wrk2, and why those tools exist at all. wrk2 takes a target rate and holds the cadence regardless of how the server behaves, so a stall shows up as a pile of requests that all started late, which is what a real user queue does. Run the old wrk against the new wrk2 on a service with a periodic GC pause and you'll watch the p99 jump by an order of magnitude. Nothing changed but the measurement.

It's not just benchmarks

The same trap lives in production telemetry. If you only record latency for requests your server actually accepted, you've omitted every request that got queued at the load balancer, dropped by a full connection pool, or shed during an overload, which are precisely your worst-latency events from the user's point of view. The metric looks healthiest exactly when the system is suffering most, because the suffering requests never made it far enough to be timed. Measuring only the survivors is coordinated omission wearing a production badge.

Rules of thumb

Closed-loop load tests that send the next request only after the last one returns will under-report tail latency, because they stop sending during the stalls they should be measuring.
Coordinated omission barely touches the average and devastates p99/p999. A suspiciously clean high percentile under load is the tell.
Measure latency from when a request was scheduled to start, not from when it actually started, so queueing during slowdowns is counted.
Use rate-controlled tools (wrk2, HdrHistogram with the correction) that hold a fixed request cadence instead of backing off when the server slows.
In production, you're omitting too if you only time requests the server accepted. Count the queued, dropped, and shed ones, or your dashboard lies hardest during incidents.
When a benchmark and your users disagree about latency, trust the users and suspect the benchmark's loop first.

Your p99 Is Lying: Coordinated Omission and the Latency You Never Measured

The setup that quietly cheats

Why this wrecks the high percentiles specifically

A concrete way to feel it

What "correct" looks like

It's not just benchmarks

Rules of thumb

2 replies// weighed in

More from this topic

Structured Logging That Actually Helps On-Call

Metrics, Logs, Traces: When to Reach for Which

SLOs and Error Budgets: Alerting on What Users Actually Feel