MVCC, Vacuum & Why Your Table Got 10× Bigger
Postgres never updates a row in place — it writes a new version and tombstones the old one. Understanding that is how you stop fearing VACUUM.
Most Postgres outages I have been paged for trace back to one of three letters: MVCC. Multi-Version Concurrency Control is what lets a hundred sessions read a table while one of them rewrites it without anybody taking a lock. It is also the reason a 2 GiB table can mysteriously balloon to 20 GiB overnight.
1. The promise: readers don't block writers
Every row in Postgres carries two hidden columns: xmin (the transaction that created it) and xmax (the transaction that deleted or superseded it). A read query at transaction X sees only rows where xmin ≤ X < xmax. That snapshot rule is the entire concurrency model.
SELECT xmin, xmax, id, balance FROM accounts WHERE id = 42;
-- xmin=10421 xmax=0 id=42 balance=100
2. The cost: dead tuples
When you UPDATE, Postgres does not edit the row. It writes a new row version with the new xmin and sets xmax on the old version. The old version sticks around until no live transaction could still see it.
That cemetery of old versions is what we call dead tuples. They cost disk, they cost cache, and they cost index lookups because every index still points at them.
3. VACUUM, autovacuum, and the bloat spiral
VACUUM reclaims space inside the table file. Crucially: it does not shrink the file back to the OS — it just marks the pages reusable. Only VACUUM FULL rewrites the table, and it takes an ACCESS EXCLUSIVE lock.
Autovacuum runs in the background. Its thresholds default to autovacuum_vacuum_scale_factor=0.2 — Postgres waits until 20% of the table is dead before triggering. On a 100M-row table that is 20M dead tuples accumulating before anything happens.
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_cost_limit = 2000
);
4. Long-running transactions are the silent killer
Autovacuum cannot remove a row version while any transaction could still need it. One forgotten BEGIN in a psql session — or one stuck connection from an analytics tool — and bloat accumulates until you kill the session.
SELECT pid, state, xact_start, query
FROM pg_stat_activity
WHERE xact_start < now() - interval '5 minutes'
ORDER BY xact_start;
Put a job on a five-minute cadence that flags any transaction older than your usual longest legitimate one. It pays for itself the first time it catches a hung migration.
5. Index bloat is a different problem
Indexes accumulate their own dead entries. VACUUM cleans them, but a heavily-churned index can still drift wider than the data warrants. REINDEX CONCURRENTLY rebuilds without taking the table offline (Postgres 12+).
Rules of thumb
- Tune autovacuum per hot table, not globally.
- Monitor
n_dead_tupfrompg_stat_user_tablesand alert when it crosses a percentage of live rows. - Treat any transaction older than 60 seconds as a paging event.
- Reach for
VACUUM FULLrarely, with a maintenance window.pg_repackis the online alternative.