ML
Concurrency

Race Conditions: The Bugs That Only Show Up In Production

Why your unit tests passed and your bank balance didn't. A practical intro to races, with Java, Go, and Postgres examples.

May 02, 20258 min readConcurrencyCorrectness

A race condition is a bug whose outcome depends on the relative timing of two or more operations. The code is logically correct; the interleaving isn't. That's why it passes tests, passes staging, and blows up on Black Friday.

The canonical example: lost update

// Two users both redeem a 100-credit coupon at the same time:
const user = await findUser(id);        // T1 reads credits=100
                                         // T2 reads credits=100
user.credits -= 100;                     // T1: 0
await save(user);                        // T1 writes 0
                                         // T2: 0
                                         // T2 writes 0  → user redeemed twice!

Two read-modify-write flows interleaved. Final state: user got double credit. No exception, no stack trace — just wrong data.

Three places races love to live

  1. Check-then-act. if (!exists) create(). Two threads both see "not exists," both create.
  2. Read-modify-write. As above. The cure is one atomic operation, not two.
  3. Initialisation. Lazy-initialised singletons that aren't synchronised correctly.

The fixes, from cheapest to strongest

  • Atomic DB operation. UPDATE users SET credits = credits - 100 WHERE id = ? AND credits >= 100. One statement, the database handles the concurrency. Check rows_affected.
  • Optimistic locking. Version column, retry on conflict. Perfect for contested-but-infrequent writes.
  • Pessimistic lock. SELECT … FOR UPDATE. Slower but foolproof.
  • Atomic primitive. In-memory: AtomicLong, sync.Mutex, channel-based ownership.
  • Idempotent design. Client sends an idempotency key; server deduplicates. Converts the race into a non-issue.

How to find races you don't know about

Load test. A single-threaded benchmark cannot find them. Tools like jcstress (Java), the Go race detector (go test -race), and Jepsen-style fault injection expose races deterministically.

Mental model

Any time you read a value, compute something from it, then write it back — assume something else ran in the gap. Either push the check into the write itself, or hold a lock across both.

SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Bảo Trần🇻🇳 Cần Thơ· Software EngineerStory

    Bọn em từng deadlock cổ điển 2-row trong ledger. Ordering by account_id ASC trước khi lock — 1 dòng commit, drop deadlock retries 98% trong tuần. Nhớ mãi vì PR đó merge lúc mình về quê ăn Tết.

    May 05, 2025·3 days later
  2. Sofia Marquez· Backend LeadAgrees

    immutability as default is the single most under-rated concurrency advice. every nightmare I've debugged in 8 years comes back to someone mutating shared state "just this once." make wrong things hard to express.

    May 06, 2025·4 days later
  3. Hiếu Nguyễn· Full StackPushback

    tiny precision nit — volatile in Java provides visibility AND atomicity for single 32-bit reads/writes (long/double on legacy 32-bit JVMs is the exception). worth being precise because juniors read "visibility primitive" and reach for AtomicInteger when volatile is enough.

    May 09, 2025·1 week later·edited
  4. Maya Iyer· PlatformFrom experience

    Go's race detector is criminally under-used. Caught a bug in our scheduler we'd been running past for 6 months — turned out our "thread-safe" map was thread-safe in the way a chair is bulletproof. -race in CI, no exceptions.

    May 07, 2025·5 days later
  5. Tomáš Havel· Senior EngineerAgrees

    go channels solve a problem you don't have until you have it, and then they're the only thing that solves it. people reaching for sync.Mutex everywhere are usually one refactor away from a clean channel topology.

    May 08, 2025·6 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    May 05, 2025·3 days later
  7. Omar Khalil· Senior SWEKind words

    this is the third article from this blog I've sent to my team this month. you're cooking. don't switch to crypto.

    May 07, 2025·5 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email