ML
Database

The Nine-Hour Migration: Postgres, Locks, and Why I Now Test Drills in Prod-Shaped Replicas

A simple ALTER TABLE that should have been instant. A user_id column. A live application. Nine hours of downtime. Here's the post-mortem and the playbook.

December 22, 202511 min readDatabasePostgresPostmortem

The migration was three lines. I'm including all three because it's important to remember how short the bullet was that put us on the floor.

BEGIN;
ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
COMMIT;

It was 22:00 ICT. We were adding a column for a feature shipping the next morning. The orders table had 320 million rows. I hit run. Postgres said "acquiring lock." And then it sat there. And then it sat there. And then I started reading the Postgres docs about lock priority queues, which is never something you want to be reading at 22:13.

What I expected vs what happened

Adding a nullable column to a Postgres table without a default is — by the docs — a metadata-only operation. It rewrites the table catalog, sets a flag, done. Should take milliseconds.

What happens in practice on a busy production table is something different. The ALTER TABLE needs an ACCESS EXCLUSIVE lock. To get one, it has to wait for every existing transaction touching the table to commit or roll back. Including, importantly, every transaction queued behind a long-running transaction.

We had a long-running transaction. Buried in our analytics service was a query that calculated rolling 30-day order trends. It ran every 5 minutes, took roughly 90 seconds, and held a SHARE lock on orders for the duration.

The ALTER TABLE walked into the lock queue at 22:00:03. Sat behind the analytics SHARE lock. The next thing that walked into the queue was a perfectly normal customer order. That order — and every order after it — got blocked behind the ALTER TABLE. Postgres' lock queue is FIFO with respect to incompatible lock requests. We had a queue of customers behind a queue of one DDL behind one bad analytics query.

By 22:01, the connection pool was full. By 22:02, the API was returning 500s. The site was effectively down.

The fastest fix that wasn't (22:02 → 22:09)

I tried to abort the migration. Ctrl-C. pg_terminate_backend. Postgres reported "backend terminated." The lock queue did not unwind cleanly — for whatever reason, it took close to 7 minutes for the queue behind the migration to drain. We were getting paged from three different alerting systems by then.

This is the single most important thing to understand: cancelling the DDL is not instant. The lock queue has to unwind. If your DDL has a queue behind it, you've already incurred the outage.

The decision tree (22:09 → 22:31)

By 22:09 the queue had drained. The site was sort of back. Customer orders were flowing again. The migration had not run. We had a feature shipping in 9 hours that depended on this column. I had a ticking clock and three real choices.

Choice 1: SET lock_timeout = '2s'; and retry. If the ALTER can't grab the lock in 2s, fail. Try again. Eventually a window opens.

Choice 2: Use pg_repack or a logical-rewrite tool. Heavy hammer. Three days to plan. Not happening tonight.

Choice 3: Schedule for a maintenance window. Easy answer. Means the feature ships next week. Means I have to send the email.

I picked Choice 1.

The retry that worked (22:31 → 04:18)

SET lock_timeout = '2s';
DO $$
DECLARE
  done boolean := false;
BEGIN
  WHILE NOT done LOOP
    BEGIN
      ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
      done := true;
    EXCEPTION
      WHEN lock_not_available THEN
        PERFORM pg_sleep(30);
    END;
  END LOOP;
END $$;

This script tries every 30 seconds, gives up after 2 seconds of waiting, and won't ever queue. The trade-off is you pay extra round-trips and you might wait a long time for a quiet enough window. We waited five hours and forty-seven minutes. The script finally caught a 1-second window between analytics runs at 04:18 and the migration completed in 12 milliseconds.

The feature shipped on time. I went home and slept until 11:00 the next day. My pull request that morning was titled "DDL safety wrapper — retry-with-timeout — never-again edition."

The five rules I codified after

  1. Always set lock_timeout for any DDL on a busy table. Default "wait forever" is the worst possible behaviour in a high-traffic system.
  2. Never run DDL inside an explicit BEGIN/COMMIT. The transaction holds the lock plus all the other locks it touches, longer than necessary.
  3. Identify long-running readers before you touch DDL. A 90-second analytics SHARE lock is plenty to wedge a deploy. We now use pg_stat_activity with a wait-event filter to verify the table is quiet before we run.
  4. Test DDL on a prod-shaped replica. Not a stage with 10k rows. A replica with a representative load. We use a logical replica we can detach for migration drills.
  5. Have a back-out plan that doesn't depend on cancellation being clean. If your roll-forward strategy is "hit ctrl-c if it gets stuck," you don't have a strategy. You have a wish.

The bigger lesson

The most dangerous Postgres operation isn't the one you'd expect. It's not VACUUM FULL, it's not UPDATE without a where clause, it's not even DROP TABLE. It's a perfectly innocent ALTER TABLE that the docs describe as instant, run on a table where someone, somewhere, has a long-held SHARE lock for analytics.

Every time I sit down to run a DDL on a busy Postgres table now, I run a checklist that takes three minutes. I do not run the DDL until the checklist clears. The checklist is exactly the five rules above. It saves me five hours of pacing every quarter.

If I had run that checklist at 21:55 ICT, this would not be a blog post. Which is, I suppose, the whole point.

SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Victor Petrov· Senior BackendAgrees

    "removing indexes is the third-year move" — saving this. first year you add, second year you tune, third year you realise half of them have been dead weight slowing every write since that one feature got cut.

    Dec 24, 2025·2 days later
  2. Greta Schäfer· Senior EngineerAgrees

    isolation levels are the part juniors skip and seniors keep underestimating. read committed feels safe; it's not — phantom reads still exist. the moment money is involved, default to repeatable read and benchmark.

    Dec 29, 2025·1 week later
  3. Aya Fujimoto· Database EngineerPushback

    small note on Postgres covering indexes — INCLUDE columns don't get the same treatment as key columns for HOT updates and dedup. people assume index-only scan == free, but the heap visibility map can still send you back to disk. measure first.

    Dec 26, 2025·4 days later
  4. Dan O'Connor· Eng ManagerStory

    N+1 cost us a P1 incident last Black Friday. ONE endpoint was firing 3,400 queries per cart load. Looked fine in dev (3 carts), looked fine in staging (50 carts), production hit 9k QPS to the DB and cardiac arrest. Add load tests.

    Dec 25, 2025·3 days later
  5. Linh Phạm· Java DeveloperAgrees

    @Version optimistic locking is the 80/20 of CRUD apps. genuinely think JPA should make it default-on with a @NoVersion opt-out. the number of last-write-wins races I've debugged in spring boot apps that didn't set it is too damn high

    Dec 27, 2025·5 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    Dec 25, 2025·3 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email