ML
Concurrency

The ABA Problem: When Compare-And-Swap Said Nothing Changed and It Lied

A lock-free stack built on compare-and-swap passed every load test and then corrupted itself in production, popping a node that had already been freed. The CAS never failed, and a failing CAS was the only case anyone had thought to worry about. The bug was a value swinging from A to B and back to A while a paused thread was not looking, so the compare saw the same bit pattern and happily proceeded on data that was no longer the data it started with.

July 02, 20269 min readConcurrencyLock-Free

We had a lock-free stack backing a per-core work queue, built the standard way on top of compare_and_swap: read the head pointer, compute the new head, CAS the old value in for the new one, retry on failure. It handled millions of pushes and pops a second under load tests with zero corruption. In production, under real scheduling jitter, it occasionally popped a node whose memory had already been reused by something else, and the thread that got it dereferenced garbage. The crash was in code that had not been touched in a year, on a data structure whose whole design point was correctness under concurrency. Every CAS in the failing path had reported success.

The operation that looked atomic and safe

A lock-free pop reads the current head, reads that node's next pointer, and then does one CAS: replace the head with next, but only if the head is still the node we just read. If another thread popped first, the CAS fails on the mismatch and we retry. That is the entire safety argument, and it depends on one assumption: if the CAS reports that the head still equals the value we read, then nothing relevant happened to the stack while we were reading next.

Node* pop() {
  Node* head = top.load();
  while (head != nullptr) {
    Node* next = head->next;              // read while unpaused
    if (top.compare_exchange_weak(head, next))  // only succeeds if top is unchanged
      return head;
    // else: top changed, head now holds the fresh value, retry
  }
  return nullptr;
}

That assumption is the bug. The CAS only checks that the pointer value stored in top is bit-for-bit equal to what we last read. It cannot see whether the stack underneath that pointer stayed the same. It can only see the address.

A, then B, then A again

Thread 1 reads head = A and next = A->next = B, and is then paused right there, before its CAS runs, by a scheduler quantum expiring at the worst possible instant. While it is paused, threads 2 and 3 run to completion: thread 2 pops A, thread 3 pops B, and then something pushes a brand new node that happens to get allocated at the exact same address A used to occupy, because our allocator recycles freed memory quickly. That new node is pushed back onto the stack. The stack's top pointer is now, again, the bit pattern A, but it identifies a different node with a different next pointer than the one thread 1 read.

Thread 1 wakes up and runs its CAS: compare top to the head it is holding, which is the address A. They match, because the address genuinely is the same, so the CAS reports success and installs next, the B pointer thread 1 read minutes-in-CPU-time ago, as the new top. But B was already popped and possibly already freed or reused. The stack now points at memory it does not own, and every push or pop after that walks a corrupted chain.

t1: head = A, next = B          (read, then preempted)
t2: pop A                       -- top is now B
t3: pop B                       -- top is now null
??: push new node at address A  -- allocator reused A's memory, top is now A (different node)
t1: wakes up, CAS(top: A -> B)  -- succeeds! top == A by address, but it's not the same A
                                 -- stack now points at B, which is freed/stale

The CAS did exactly what it was asked: verify the current value equals the expected value. Nobody asked it to verify that the value had not been recycled through a different identity in between. That gap, address equality standing in for identity, is the ABA problem: A changed to B and back to A, and a check that only compares current state to expected state cannot tell the difference between "nothing happened" and "everything happened and it circled back."

Why it survived load testing

Reproducing this needs a specific interleaving: a thread has to be preempted between the read and the CAS, and in that exact window another thread has to fully recycle the same memory address back into the same data structure. Under a synthetic load test with a handful of threads hammering the same operation in a tight loop, the timing rarely lines up, and even when a thread is preempted, the freed node often is not reused fast enough, or is reused somewhere else entirely, to reappear at the same address before the CAS runs. In production, with real scheduling noise, other allocations competing for the same memory pool, and the stack running for days instead of minutes, the exact window opened often enough to matter. The bug was not rare in an absolute sense. It was rare per operation and we were doing billions of operations.

The fixes that do not actually work

Retrying more aggressively, adding a memory fence around the read, or double-checking next after the CAS all miss the point, because the problem is not a memory ordering bug. The CAS is reading correctly ordered, up-to-date memory. It is comparing the right bytes and getting a true match. The bytes are simply not a reliable proxy for "this is the same logical value I read before," because the underlying node was freed and reallocated in between. No amount of fencing recovers information that was already destroyed when the node got reused.

The real fix: make identity bigger than an address

The standard fix is a tagged pointer, sometimes called a "stamped" or "counted" pointer: pack the address together with a monotonically incrementing counter, and CAS the pair as one atomic unit. Every push or pop increments the counter regardless of which address comes back. Now the sequence A, then B, then A again produces two different tagged values, (A, tag=5) and (A, tag=8), even though the address portion matches. The CAS in the paused thread compares the full tagged value, sees the tag no longer matches, and correctly fails and retries instead of committing corruption.

struct TaggedPtr { Node* addr; uint64_t tag; };  // CAS'd as one 128-bit unit

Node* pop() {
  TaggedPtr old_top = top.load();
  while (old_top.addr != nullptr) {
    TaggedPtr new_top{old_top.addr->next, old_top.tag + 1};
    if (top.compare_exchange_weak(old_top, new_top))
      return old_top.addr;
  }
  return nullptr;
}

This needs double-width CAS support, cmpxchg16b on x86-64 for a pointer plus a 64-bit counter, which most modern platforms have but which is a real portability constraint. Where that is not available, hazard pointers or epoch-based reclamation solve the same problem from a different angle: instead of tagging every pointer, a thread publishes which nodes it is currently touching, and the allocator refuses to actually recycle a node's memory until no thread has it hazard-marked. That removes the "same address, different content" scenario entirely, because the address genuinely cannot be reused while anyone might still be reading it.

Why it hid

Everyone who reviews a CAS loop checks the failure branch, because a failing CAS retrying is the obviously interesting case. A CAS that succeeds looks like the boring, correct path by construction, because success is supposed to mean "nothing changed." ABA breaks exactly that intuition: the dangerous case is a CAS that succeeds when it should not have, and there is no local signal that anything is wrong. The stack looks internally consistent right up until a pointer walk hits freed memory somewhere downstream, often in a completely different part of the code than the pop that caused it.

Rules of thumb

  • Compare-and-swap only proves that a value's current bit pattern matches an earlier reading. It cannot prove that nothing happened to the underlying data in between.
  • The ABA problem shows up wherever memory or values can be recycled: pointer-based lock-free structures, freelists, or any CAS loop on a value drawn from a reused pool.
  • A failing CAS is not the dangerous case, it just retries. A succeeding CAS that should have failed is the one that corrupts state, and it leaves no local trace.
  • Tagged or stamped pointers turn "same address" into "same address and same generation" by CASing a counter alongside the pointer, closing the gap that plain address comparison leaves open.
  • Hazard pointers and epoch-based reclamation solve the same problem by delaying reuse instead of detecting it, preventing the same address from ever meaning two different things while a thread might still be reading it.
  • This class of bug needs a specific preemption window plus fast memory reuse to reproduce, so it can pass load tests for a long time and only surface under real production scheduling and allocation pressure.
SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Irene Chen· Staff EngineerAgrees

    "push the check into the write" is now the framing I use teaching juniors. once you see check-then-act anti-patterns you can't un-see them — they're hiding in literally every internal tool we have.

    Jul 04, 2026·2 days later
  2. Bảo Trần🇻🇳 Cần Thơ· Software EngineerStory

    Bọn em từng deadlock cổ điển 2-row trong ledger. Ordering by account_id ASC trước khi lock — 1 dòng commit, drop deadlock retries 98% trong tuần. Nhớ mãi vì PR đó merge lúc mình về quê ăn Tết.

    Jul 05, 2026·3 days later
  3. Sofia Marquez· Backend LeadAgrees

    immutability as default is the single most under-rated concurrency advice. every nightmare I've debugged in 8 years comes back to someone mutating shared state "just this once." make wrong things hard to express.

    Jul 06, 2026·4 days later
  4. Hiếu Nguyễn· Full StackPushback

    tiny precision nit — volatile in Java provides visibility AND atomicity for single 32-bit reads/writes (long/double on legacy 32-bit JVMs is the exception). worth being precise because juniors read "visibility primitive" and reach for AtomicInteger when volatile is enough.

    Jul 09, 2026·1 week later·edited
  5. Maya Iyer· PlatformFrom experience

    Go's race detector is criminally under-used. Caught a bug in our scheduler we'd been running past for 6 months — turned out our "thread-safe" map was thread-safe in the way a chair is bulletproof. -race in CI, no exceptions.

    Jul 07, 2026·5 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Jul 04, 2026·2 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email