Token Bucket vs Leaky Bucket: The Rate Limiter That Let the Burst Through
A rate limiter rejected traffic smoothly in every test and then let a client through at ten times its contracted rate during a real burst, taking down a downstream dependency that had no defenses of its own. The limiter was not broken. It was a token bucket, correctly implemented, doing exactly what a token bucket is designed to do, which turned out to be the wrong algorithm for a downstream that could not absorb bursts at all.
We rate-limited an internal API at 100 requests per second per client to protect a downstream billing service that could sustain about that much load comfortably. The limiter passed every test we threw at it: steady traffic at 100 rps sailed through, traffic above that got rejected with a 429, and the numbers on the dashboard matched the configured limit almost exactly. Then a client that had been idle for a few minutes woke up and sent a burst, and the billing service fell over under a spike that peaked at roughly 1,000 requests in under a second, ten times the configured limit, from a client the rate limiter was actively enforcing against the whole time.
The limiter was not lying
The rate limiter reported the truth: over any one-second window, that client's long-run average was within its 100 rps allowance. The billing service did not care about a one-second average. It cared about how many requests hit it in the same hundred milliseconds, and in that narrow window it saw far more than a hundredth of a second's fair share. The limiter and the thing it was protecting were measuring, and caring about, two different things.
We were running a token bucket: a bucket holds up to burst_size tokens, refills at rate tokens per second, and a request is allowed if it can take one token out, immediately, no queueing. We had set rate = 100 and, following a common default, burst_size = rate, giving the bucket a capacity of 100 tokens.
class TokenBucket {
tokens = burst_size;
lastRefill = now();
allow() {
const elapsed = now() - this.lastRefill;
this.tokens = Math.min(burst_size, this.tokens + elapsed * rate);
this.lastRefill = now();
if (this.tokens >= 1) { this.tokens -= 1; return true; }
return false;
}
}
A bucket with 100 tokens of capacity that has been sitting idle for a few minutes is, by construction, full: 100 tokens saved up, because nothing drained it. The very next moment, all 100 of those tokens can be spent in a single burst, all 100 requests going out essentially instantaneously, not spread across the following second. The token bucket's contract is a long-run average rate plus a permitted burst up to its capacity, and we had sized that permitted burst at "the entire second's allowance, delivered at once," which is exactly what happened.
Why the tests never caught it
Every load test we wrote generated traffic as a steady stream, because that is the natural way to write a load generator: loop, send a request, sleep for the interval, repeat. A steady stream never lets the bucket accumulate more than a token or two of slack, so it never exercises the accumulated-burst path at all. The failure mode only appears when a client goes idle long enough to fill the bucket and then sends a genuine burst, a pattern that shows up constantly in production, retry-after-outage traffic, a batch job kicking off, a client reconnecting after a network blip, but almost never in a synthetic benchmark built around constant-rate traffic.
The knobs that looked like fixes
Lowering rate to something like 50 rps reduces the long-run average but does not touch the burst behavior at all: a bucket that has been idle still fills to its capacity and still empties in one shot, just a smaller shot. Lowering burst_size alone helps, but purely as a magnitude knob, it does not change the fact that the algorithm's design intentionally allows saved-up capacity to be spent instantly rather than spread out. You can shrink the burst until it stops being dangerous, but you are choosing that number by trial and error against a downstream you may not fully understand yet.
The real fix: match the algorithm to what downstream can absorb
The billing service could sustain 100 rps but had essentially no burst tolerance, no queue, no buffering, a request either got handled promptly or it timed out. That is precisely the case a leaky bucket is built for. Requests go into a fixed-size queue, and a background process drains the queue and forwards requests to downstream at a strictly constant rate, regardless of how bursty the arrivals were. Where a token bucket answers "is this request allowed right now," a leaky bucket answers "here is a smoothed, constant-rate output no matter how the input arrived," and a request that would overflow the queue gets rejected up front rather than forwarded downstream in a spike.
class LeakyBucket {
queue = []; // requests waiting to be forwarded
// background loop, runs every 1000/rate ms:
// if queue.length > 0: forward(queue.shift())
allow(request) {
if (this.queue.length >= max_queue_size) return false; // reject, bucket overflowing
this.queue.push(request);
return true; // accepted, will be forwarded at the fixed drain rate
}
}
With the leaky bucket in front of billing, the same idle-then-burst client still gets all of its requests accepted up to the queue limit, but downstream never sees more than 100 rps leave the queue, because the drain rate is fixed independent of the arrival pattern. The burst gets absorbed and smoothed into a straight line instead of passed through as a spike. We kept the token bucket at the outer edge of the API, where bursts are fine and low latency for legitimate spiky clients matters, and put a leaky bucket specifically in front of the billing dependency that had zero burst tolerance.
Why it hid
"Rate limiter" reads as a single concept, and token bucket is the default most people reach for because it is simple, memory-light, and forgiving of legitimate bursty clients. Its defining property, that saved-up capacity can be spent all at once, is also its documented behavior, not a bug, so nobody flags it in review. The mismatch only exists relative to a specific downstream's actual tolerance for burstiness, which is a property of the thing being protected, not of the rate limiter itself, so it never shows up by inspecting the limiter in isolation.
Rules of thumb
- A token bucket enforces a long-run average rate and explicitly allows saved-up capacity to be spent in a single burst up to its size. That is the design, not a defect.
- A leaky bucket enforces a constant output rate regardless of how bursty the input is, by queueing and draining at a fixed pace, at the cost of added latency for queued requests.
- Pick the algorithm based on what the protected resource can actually absorb: bursty-tolerant downstreams fit a token bucket, downstreams with no burst headroom need a leaky bucket or a hard concurrency cap.
- Load tests built from steady-rate traffic generators will never exercise a token bucket's saved-up-burst path. Test with idle-then-burst patterns specifically.
- Sizing
burst_sizedown reduces the blast radius but does not change the underlying behavior. Know which failure mode you are choosing, not just which number. - A rate limiter reporting correct enforcement at its own layer does not guarantee the downstream experiences a smooth rate. Measure the request pattern the downstream actually receives, not just the limiter's own counters.