ML
Concurrency

The Lock That Held Two Owners: A GC Pause Versus a Redis TTL

We used a Redis lock with a TTL to make sure exactly one worker processed each payout. For months it worked, then a batch ran twice and double-charged a handful of accounts. The lock was doing its job perfectly. The problem was that a stop-the-world GC pause outlived the lock's expiry, so a worker that believed it still held the lock kept working while a second worker had already taken it.

June 28, 20269 min readConcurrencyDistributed Systems

We process payouts in a background job, and the iron rule is that each payout batch runs exactly once. To enforce that across several worker instances we used the standard pattern: grab a lock in Redis with SET key worker-id NX PX 30000, do the work, then release it. NX means only one worker can hold it, and the 30-second TTL means that if a worker crashes mid-job the lock expires and someone else can pick it up instead of the batch hanging forever. Clean, well-understood, in production for months. Then a batch processed twice and we double-paid eleven accounts.

The lock did exactly what it promised

The first instinct was that the lock was broken, that two workers had somehow both gotten NX to succeed. They had not. The Redis logs were clear: worker A acquired the lock at 10:02:14, the key expired at 10:02:44, and worker B acquired it cleanly at 10:02:45. At no point did two workers hold the key at the same time. The lock service was behaving perfectly. The bug was in the gap between what the lock guarantees and what we assumed it guaranteed.

What we wanted was "only one worker is ever doing the work at a time". What a TTL lock actually gives you is "only one worker holds this key at a time". Those sound identical until a worker keeps doing the work after its key has quietly expired out from under it.

The pause that broke the assumption

Worker A grabbed the lock and started the batch. Partway through, the JVM hit a long stop-the-world garbage collection pause. We later found it in the GC logs: a full GC that froze every application thread for 37 seconds. During that freeze, worker A's code was not running at all, so it could not finish, could not extend the lock, could not do anything. Meanwhile Redis, which knew nothing about worker A's frozen state, did exactly what it was told and expired the key at the 30-second mark. Worker B, polling for work, saw a free lock, acquired it, and started the same batch.

Then worker A's GC pause ended. From worker A's point of view, no time had passed and it still held the lock, so it simply continued from where it froze and finished the batch. Now both workers had run the same payouts. The lock was never held by two owners simultaneously, but the work was, because a process that lost its lock had no idea it had lost it.

10:02:14  A: SET lock A NX PX 30000  -> OK, starts batch
10:02:20  A: ... long GC pause begins (stop-the-world) ...
10:02:44  redis: lock EXPIRED (A still frozen)
10:02:45  B: SET lock B NX PX 30000  -> OK, starts SAME batch
10:02:57  A: ... GC pause ends, A resumes, finishes batch ...
          => both A and B processed the payouts

Why a longer TTL is not the fix

The obvious reaction is to make the TTL longer than any plausible pause. This is a trap. Whatever number you pick, a pause, a network partition, or an overloaded box can exceed it, and you are just betting that the worst pause is smaller than your guess. Worse, a long TTL has a real cost: if a worker genuinely crashes, the lock is now held for that whole long duration before anyone else can take over, so you trade a rare correctness bug for routine long stalls. Tuning the timeout only moves the risk around. It cannot remove it, because no timeout can distinguish "this worker is dead" from "this worker is frozen and about to wake up".

Auto-renewing the lock with a background heartbeat thread has the same flaw. If the whole process is paused by GC, the heartbeat thread is paused too, so it cannot renew, and meanwhile the renewal logic gives you false confidence that the lock is safe. The heartbeat helps with slow work, not with a frozen process.

The real fix: fencing tokens

The fix is to stop trusting the lock holder and instead make the protected resource reject stale writers. Every time the lock is acquired, the lock service hands out a monotonically increasing number, a fencing token. The worker includes that token with every write to the resource it is protecting, and the resource remembers the highest token it has seen and refuses anything lower. This makes the system correct regardless of pauses, because lateness becomes detectable at the point that actually matters.

// acquire returns a strictly increasing token
token = lock.acquire("payout-batch-42")   // e.g. 17

// every write carries the token; the store enforces monotonicity
store.write(batchResult, fencingToken = 17)

// resource side: reject anything not strictly newer
if (incoming.token <= lastSeenToken) reject("stale writer");
else { lastSeenToken = incoming.token; apply(incoming); }

Replay the incident with tokens: worker A acquires with token 17. While it is frozen, worker B acquires with token 18 and writes its results, so the store's high-water mark is now 18. When worker A wakes and tries to commit with token 17, the store sees 17 is not greater than 18 and rejects the write. Worker A's stale work is thrown away at the door. Only one set of results lands, no matter how long the pause was. The lock can still be wrong about who holds it, but it can no longer cause double processing, because the resource itself is the final arbiter.

When a true monotonic token from the lock service is awkward, the same idea shows up as conditional writes: a compare-and-set on a version column, or an idempotency key on the payout so a duplicate is a no-op. The common thread is that correctness lives at the resource, not in the lock. The lock becomes an optimization that prevents wasteful concurrent work, while the fencing check is what actually guarantees safety.

Why it hid for months

A 37-second stop-the-world pause is rare. It needs a big heap, the wrong GC settings, and unlucky timing all at once, and it has to land in the middle of a job that happens to be running near the TTL boundary. For months none of those lined up, so the lock looked airtight and we trusted it completely. That is the dangerous part of this bug class: the lock works correctly almost always, which trains you to believe it is sufficient, right up until a pause longer than your timeout proves that "holds the key" was never the same promise as "is the only one working".

Rules of thumb

  • A TTL lock guarantees only that one process holds the key, not that one process is doing the work. A pause longer than the TTL breaks that assumption.
  • A frozen process does not know it is frozen. When it wakes it will keep working as if it still holds a lock that already expired and got reassigned.
  • Lengthening the TTL only changes which failure you get: shorter means false expiry under load, longer means long stalls on real crashes. No timeout separates "dead" from "paused".
  • Heartbeat auto-renewal does not survive a stop-the-world pause, because the renewing thread is paused too.
  • Put correctness at the resource with fencing tokens: a monotonic number per acquisition that the resource enforces, rejecting any write with a token it has already surpassed.
  • If a real fencing token is impractical, get the same protection from conditional writes, version compare-and-set, or idempotency keys on the operation.
  • Treat the lock as a performance optimization to avoid duplicate work, and treat the fencing check as the thing that actually keeps you correct.
SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Hiếu Nguyễn· Full StackPushback

    tiny precision nit — volatile in Java provides visibility AND atomicity for single 32-bit reads/writes (long/double on legacy 32-bit JVMs is the exception). worth being precise because juniors read "visibility primitive" and reach for AtomicInteger when volatile is enough.

    Jul 05, 2026·1 week later·edited
  2. Maya Iyer· PlatformFrom experience

    Go's race detector is criminally under-used. Caught a bug in our scheduler we'd been running past for 6 months — turned out our "thread-safe" map was thread-safe in the way a chair is bulletproof. -race in CI, no exceptions.

    Jul 03, 2026·5 days later
  3. Tomáš Havel· Senior EngineerAgrees

    go channels solve a problem you don't have until you have it, and then they're the only thing that solves it. people reaching for sync.Mutex everywhere are usually one refactor away from a clean channel topology.

    Jul 04, 2026·6 days later
  4. Irene Chen· Staff EngineerAgrees

    "push the check into the write" is now the framing I use teaching juniors. once you see check-then-act anti-patterns you can't un-see them — they're hiding in literally every internal tool we have.

    Jun 30, 2026·2 days later
  5. Bảo Trần🇻🇳 Cần Thơ· Software EngineerStory

    Bọn em từng deadlock cổ điển 2-row trong ledger. Ordering by account_id ASC trước khi lock — 1 dòng commit, drop deadlock retries 98% trong tuần. Nhớ mãi vì PR đó merge lúc mình về quê ăn Tết.

    Jul 01, 2026·3 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Jun 30, 2026·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Jul 02, 2026·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email