The Lock That Held Two Owners: A GC Pause Versus a Redis TTL
We used a Redis lock with a TTL to make sure exactly one worker processed each payout. For months it worked, then a batch ran twice and double-charged a handful of accounts. The lock was doing its job perfectly. The problem was that a stop-the-world GC pause outlived the lock's expiry, so a worker that believed it still held the lock kept working while a second worker had already taken it.
We process payouts in a background job, and the iron rule is that each payout batch runs exactly once. To enforce that across several worker instances we used the standard pattern: grab a lock in Redis with SET key worker-id NX PX 30000, do the work, then release it. NX means only one worker can hold it, and the 30-second TTL means that if a worker crashes mid-job the lock expires and someone else can pick it up instead of the batch hanging forever. Clean, well-understood, in production for months. Then a batch processed twice and we double-paid eleven accounts.
The lock did exactly what it promised
The first instinct was that the lock was broken, that two workers had somehow both gotten NX to succeed. They had not. The Redis logs were clear: worker A acquired the lock at 10:02:14, the key expired at 10:02:44, and worker B acquired it cleanly at 10:02:45. At no point did two workers hold the key at the same time. The lock service was behaving perfectly. The bug was in the gap between what the lock guarantees and what we assumed it guaranteed.
What we wanted was "only one worker is ever doing the work at a time". What a TTL lock actually gives you is "only one worker holds this key at a time". Those sound identical until a worker keeps doing the work after its key has quietly expired out from under it.
The pause that broke the assumption
Worker A grabbed the lock and started the batch. Partway through, the JVM hit a long stop-the-world garbage collection pause. We later found it in the GC logs: a full GC that froze every application thread for 37 seconds. During that freeze, worker A's code was not running at all, so it could not finish, could not extend the lock, could not do anything. Meanwhile Redis, which knew nothing about worker A's frozen state, did exactly what it was told and expired the key at the 30-second mark. Worker B, polling for work, saw a free lock, acquired it, and started the same batch.
Then worker A's GC pause ended. From worker A's point of view, no time had passed and it still held the lock, so it simply continued from where it froze and finished the batch. Now both workers had run the same payouts. The lock was never held by two owners simultaneously, but the work was, because a process that lost its lock had no idea it had lost it.
10:02:14 A: SET lock A NX PX 30000 -> OK, starts batch
10:02:20 A: ... long GC pause begins (stop-the-world) ...
10:02:44 redis: lock EXPIRED (A still frozen)
10:02:45 B: SET lock B NX PX 30000 -> OK, starts SAME batch
10:02:57 A: ... GC pause ends, A resumes, finishes batch ...
=> both A and B processed the payouts
Why a longer TTL is not the fix
The obvious reaction is to make the TTL longer than any plausible pause. This is a trap. Whatever number you pick, a pause, a network partition, or an overloaded box can exceed it, and you are just betting that the worst pause is smaller than your guess. Worse, a long TTL has a real cost: if a worker genuinely crashes, the lock is now held for that whole long duration before anyone else can take over, so you trade a rare correctness bug for routine long stalls. Tuning the timeout only moves the risk around. It cannot remove it, because no timeout can distinguish "this worker is dead" from "this worker is frozen and about to wake up".
Auto-renewing the lock with a background heartbeat thread has the same flaw. If the whole process is paused by GC, the heartbeat thread is paused too, so it cannot renew, and meanwhile the renewal logic gives you false confidence that the lock is safe. The heartbeat helps with slow work, not with a frozen process.
The real fix: fencing tokens
The fix is to stop trusting the lock holder and instead make the protected resource reject stale writers. Every time the lock is acquired, the lock service hands out a monotonically increasing number, a fencing token. The worker includes that token with every write to the resource it is protecting, and the resource remembers the highest token it has seen and refuses anything lower. This makes the system correct regardless of pauses, because lateness becomes detectable at the point that actually matters.
// acquire returns a strictly increasing token
token = lock.acquire("payout-batch-42") // e.g. 17
// every write carries the token; the store enforces monotonicity
store.write(batchResult, fencingToken = 17)
// resource side: reject anything not strictly newer
if (incoming.token <= lastSeenToken) reject("stale writer");
else { lastSeenToken = incoming.token; apply(incoming); }
Replay the incident with tokens: worker A acquires with token 17. While it is frozen, worker B acquires with token 18 and writes its results, so the store's high-water mark is now 18. When worker A wakes and tries to commit with token 17, the store sees 17 is not greater than 18 and rejects the write. Worker A's stale work is thrown away at the door. Only one set of results lands, no matter how long the pause was. The lock can still be wrong about who holds it, but it can no longer cause double processing, because the resource itself is the final arbiter.
When a true monotonic token from the lock service is awkward, the same idea shows up as conditional writes: a compare-and-set on a version column, or an idempotency key on the payout so a duplicate is a no-op. The common thread is that correctness lives at the resource, not in the lock. The lock becomes an optimization that prevents wasteful concurrent work, while the fencing check is what actually guarantees safety.
Why it hid for months
A 37-second stop-the-world pause is rare. It needs a big heap, the wrong GC settings, and unlucky timing all at once, and it has to land in the middle of a job that happens to be running near the TTL boundary. For months none of those lined up, so the lock looked airtight and we trusted it completely. That is the dangerous part of this bug class: the lock works correctly almost always, which trains you to believe it is sufficient, right up until a pause longer than your timeout proves that "holds the key" was never the same promise as "is the only one working".
Rules of thumb
- A TTL lock guarantees only that one process holds the key, not that one process is doing the work. A pause longer than the TTL breaks that assumption.
- A frozen process does not know it is frozen. When it wakes it will keep working as if it still holds a lock that already expired and got reassigned.
- Lengthening the TTL only changes which failure you get: shorter means false expiry under load, longer means long stalls on real crashes. No timeout separates "dead" from "paused".
- Heartbeat auto-renewal does not survive a stop-the-world pause, because the renewing thread is paused too.
- Put correctness at the resource with fencing tokens: a monotonic number per acquisition that the resource enforces, rejecting any write with a token it has already surpassed.
- If a real fencing token is impractical, get the same protection from conditional writes, version compare-and-set, or idempotency keys on the operation.
- Treat the lock as a performance optimization to avoid duplicate work, and treat the fencing check as the thing that actually keeps you correct.