The Nine-Hour Migration: Postgres, Locks, and Why I Now Test Drills in Prod-Shaped Replicas
A simple ALTER TABLE that should have been instant. A user_id column. A live application. Nine hours of downtime. Here's the post-mortem and the playbook.
The migration was three lines. I'm including all three because it's important to remember how short the bullet was that put us on the floor.
BEGIN;
ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
COMMIT;
It was 22:00 ICT. We were adding a column for a feature shipping the next morning. The orders table had 320 million rows. I hit run. Postgres said "acquiring lock." And then it sat there. And then it sat there. And then I started reading the Postgres docs about lock priority queues, which is never something you want to be reading at 22:13.
What I expected vs what happened
Adding a nullable column to a Postgres table without a default is — by the docs — a metadata-only operation. It rewrites the table catalog, sets a flag, done. Should take milliseconds.
What happens in practice on a busy production table is something different. The ALTER TABLE needs an ACCESS EXCLUSIVE lock. To get one, it has to wait for every existing transaction touching the table to commit or roll back. Including, importantly, every transaction queued behind a long-running transaction.
We had a long-running transaction. Buried in our analytics service was a query that calculated rolling 30-day order trends. It ran every 5 minutes, took roughly 90 seconds, and held a SHARE lock on orders for the duration.
The ALTER TABLE walked into the lock queue at 22:00:03. Sat behind the analytics SHARE lock. The next thing that walked into the queue was a perfectly normal customer order. That order — and every order after it — got blocked behind the ALTER TABLE. Postgres' lock queue is FIFO with respect to incompatible lock requests. We had a queue of customers behind a queue of one DDL behind one bad analytics query.
By 22:01, the connection pool was full. By 22:02, the API was returning 500s. The site was effectively down.
The fastest fix that wasn't (22:02 → 22:09)
I tried to abort the migration. Ctrl-C. pg_terminate_backend. Postgres reported "backend terminated." The lock queue did not unwind cleanly — for whatever reason, it took close to 7 minutes for the queue behind the migration to drain. We were getting paged from three different alerting systems by then.
This is the single most important thing to understand: cancelling the DDL is not instant. The lock queue has to unwind. If your DDL has a queue behind it, you've already incurred the outage.
The decision tree (22:09 → 22:31)
By 22:09 the queue had drained. The site was sort of back. Customer orders were flowing again. The migration had not run. We had a feature shipping in 9 hours that depended on this column. I had a ticking clock and three real choices.
Choice 1: SET lock_timeout = '2s'; and retry. If the ALTER can't grab the lock in 2s, fail. Try again. Eventually a window opens.
Choice 2: Use pg_repack or a logical-rewrite tool. Heavy hammer. Three days to plan. Not happening tonight.
Choice 3: Schedule for a maintenance window. Easy answer. Means the feature ships next week. Means I have to send the email.
I picked Choice 1.
The retry that worked (22:31 → 04:18)
SET lock_timeout = '2s';
DO $$
DECLARE
done boolean := false;
BEGIN
WHILE NOT done LOOP
BEGIN
ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
done := true;
EXCEPTION
WHEN lock_not_available THEN
PERFORM pg_sleep(30);
END;
END LOOP;
END $$;
This script tries every 30 seconds, gives up after 2 seconds of waiting, and won't ever queue. The trade-off is you pay extra round-trips and you might wait a long time for a quiet enough window. We waited five hours and forty-seven minutes. The script finally caught a 1-second window between analytics runs at 04:18 and the migration completed in 12 milliseconds.
The feature shipped on time. I went home and slept until 11:00 the next day. My pull request that morning was titled "DDL safety wrapper — retry-with-timeout — never-again edition."
The five rules I codified after
- Always set
lock_timeoutfor any DDL on a busy table. Default "wait forever" is the worst possible behaviour in a high-traffic system. - Never run DDL inside an explicit BEGIN/COMMIT. The transaction holds the lock plus all the other locks it touches, longer than necessary.
- Identify long-running readers before you touch DDL. A 90-second analytics SHARE lock is plenty to wedge a deploy. We now use
pg_stat_activitywith a wait-event filter to verify the table is quiet before we run. - Test DDL on a prod-shaped replica. Not a stage with 10k rows. A replica with a representative load. We use a logical replica we can detach for migration drills.
- Have a back-out plan that doesn't depend on cancellation being clean. If your roll-forward strategy is "hit ctrl-c if it gets stuck," you don't have a strategy. You have a wish.
The bigger lesson
The most dangerous Postgres operation isn't the one you'd expect. It's not VACUUM FULL, it's not UPDATE without a where clause, it's not even DROP TABLE. It's a perfectly innocent ALTER TABLE that the docs describe as instant, run on a table where someone, somewhere, has a long-held SHARE lock for analytics.
Every time I sit down to run a DDL on a busy Postgres table now, I run a checklist that takes three minutes. I do not run the DDL until the checklist clears. The checklist is exactly the five rules above. It saves me five hours of pacing every quarter.
If I had run that checklist at 21:55 ICT, this would not be a blog post. Which is, I suppose, the whole point.