ML
PostgreSQL

Write-Ahead Logging: How Postgres Survives a Crash

Every durable change in Postgres is written to the WAL before the data files. That one rule is what makes crash recovery, fsync tuning, and streaming replication all the same mechanism.

June 11, 202611 min readPostgresDurability

If you understand one internal mechanism in Postgres, make it the Write-Ahead Log (WAL). It is the reason a hard power-cut doesn't corrupt your database, the thing your synchronous_commit setting actually controls, and — surprisingly — the exact same stream that feeds replicas and point-in-time recovery. One log, three superpowers.

1. The core rule: log before data

When you UPDATE a row, Postgres does not immediately write the changed page back to the data file on disk. Writing random 8 KB pages to their scattered locations on every commit would be brutally slow, and a crash mid-write would leave the file half-updated. Instead Postgres modifies the page in its in-memory buffer (now a "dirty" page) and records a small, sequential description of the change in the WAL.

The invariant that makes this safe is the write-ahead rule: the WAL record describing a change must reach durable storage before the data page it describes is allowed to be written back. The WAL is append-only and sequential, so flushing it is fast. The data files get updated lazily, in the background, in bulk.

COMMIT
  1. change page in shared_buffers   (memory, fast)
  2. append WAL record               (memory WAL buffer)
  3. fsync the WAL up to this commit  (durable!)  <-- commit returns here
  4. ...much later, a checkpoint flushes the dirty data page

The crucial consequence: at the moment COMMIT returns, your change is durable in the WAL — even though the data file still holds the old row. Durability comes from the log, not the table.

2. Crash recovery: replaying the log

Now the power dies. The data files are stale (the dirty page never made it out), but the WAL has the committed change. On restart Postgres runs recovery: it finds the last checkpoint and replays every WAL record after it, re-applying changes to the data pages. Committed transactions are reconstructed; transactions that were in-flight at the crash are not replayed and simply never happened.

This is the same idea as a redo log in other databases — sometimes called ARIES-style logging. The recovery is idempotent: every page carries an LSN (Log Sequence Number, the byte offset of the WAL record that last changed it), so replay can tell whether a given page already reflects a given WAL record and skips it if so. You can crash during recovery and restart recovery safely.

3. Checkpoints: bounding how much to replay

If recovery had to replay all WAL since the beginning of time, startup would take hours and WAL would grow forever. A checkpoint is the bound: it flushes all currently-dirty pages to the data files and writes a checkpoint record saying "everything before this LSN is safely on disk." Recovery only ever needs to start from the last checkpoint.

-- checkpoints are driven mainly by these
max_wal_size = 1GB         -- checkpoint when this much WAL accumulates
checkpoint_timeout = 5min  -- ...or after this long, whichever comes first
checkpoint_completion_target = 0.9  -- spread the flush over 90% of the interval

There's a tension to tune. Frequent checkpoints mean fast recovery but more write amplification (the same hot page gets flushed again and again). Rare checkpoints mean less I/O but longer crash recovery and more WAL on disk. If you see periodic latency spikes that line up with checkpoints, raising max_wal_size and keeping checkpoint_completion_target near 0.9 (so the flush is spread out rather than dumped at once) is the usual fix.

4. The durability knob: synchronous_commit and fsync

"Durable in the WAL" means the WAL was fsync'd — physically forced to stable storage, not just handed to the OS page cache. That fsync at commit time is the single biggest cost of durability, and Postgres exposes exactly how much of it you want:

  • synchronous_commit = on (default) — commit waits until its WAL record is flushed to disk. A returned COMMIT survives a crash. Safe.
  • synchronous_commit = off — commit returns as soon as the record is in the WAL buffer; the flush happens a moment later (within wal_writer_delay). A crash can lose the last few hundred milliseconds of committed transactions — but it cannot corrupt the database, because the write-ahead ordering is still respected. You trade a tiny window of durability for a big latency win on commit-heavy workloads.
  • fsync = off — never do this in production. It tells Postgres to skip the flushes entirely; a crash can leave the data files genuinely corrupt and unrecoverable. It exists only for throwaway bulk-load scenarios you can recreate from scratch.

The distinction worth internalizing: synchronous_commit = off risks losing recent commits; fsync = off risks losing the whole database. They are not the same gamble.

5. The same log drives replication

Here's the elegant part. A replica doesn't need a separate mechanism to stay in sync — it just receives the primary's WAL stream and runs the same recovery replay continuously, never finishing. This is streaming replication, and it's why it's so robust: the replica is applying the identical, byte-level redo records the primary used for its own durability.

primary:  generate WAL --> walsender --> (network) -->
replica:  walreceiver --> replay WAL --> serve read queries

synchronous_commit extends here too: set it to remote_apply or on with synchronous_standby_names and the primary's commit waits until a replica has the WAL — zero data loss on failover, at the cost of commit latency bound to the network. The same WAL files, when archived (archive_mode = on), also give you point-in-time recovery: restore a base backup, then replay archived WAL up to any chosen moment. Crash recovery, replication, and PITR are three readings of one log.

6. Torn pages and full_page_writes

One subtlety that surprises people. Postgres pages are 8 KB but the OS/disk writes in smaller sectors (often 4 KB). A crash mid-write can leave a page half-old, half-new — a torn page — which a normal incremental WAL record can't repair, because it assumes the rest of the page was intact. Postgres defends against this with full_page_writes (on by default): the first time a page is modified after each checkpoint, the entire page image is written into the WAL, not just the delta. Recovery can then restore the whole page regardless of tearing.

This is also why WAL volume spikes right after a checkpoint and why a flood of checkpoints inflates WAL: every newly-touched page pays the full-page-image tax once per checkpoint cycle. It's a real cost, but turning full_page_writes off is only safe on storage that guarantees atomic 8 KB writes — most don't, so leave it on.

7. Operational gotchas

  • WAL not being recycled. If archive_command fails, or a replication slot belongs to a replica that's gone, Postgres keeps WAL it can't yet release. pg_wal fills the disk and the server stops. Monitor pg_wal size and drop orphaned replication slots.
  • A replica falling behind. If a standby can't keep up, the primary may remove WAL the replica still needs (unless a slot pins it). wal_keep_size and replication slots control that retention — slots are safer but can pin WAL indefinitely if the replica dies.
  • Bulk loads. Each commit forces a flush, so loading a million rows in a million transactions is dominated by fsync. Batch into larger transactions, or use COPY, and the per-row WAL flush cost largely disappears.

Rules of thumb

  • Durability lives in the WAL, not the table. A returned COMMIT is safe because its WAL record was flushed, even though the data page is written later.
  • Checkpoints bound recovery time and WAL size. Tune max_wal_size / checkpoint_completion_target if checkpoint-aligned latency spikes appear; don't checkpoint so rarely that recovery crawls.
  • synchronous_commit = off trades a few hundred ms of recent commits for big commit-latency wins — and never corrupts. fsync = off risks the entire database; don't.
  • Replication and PITR are WAL replay. The same redo stream that survives crashes feeds standbys and point-in-time recovery — learn it once, get all three.
  • Leave full_page_writes on unless your storage truly guarantees atomic 8 KB writes; it's what protects you from torn pages.
  • Watch pg_wal disk usage. Failed archiving or orphaned replication slots are the classic way a healthy server runs itself out of disk.
SharePostLinkedIn

Reader Discussion

2 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Jun 17, 2026·6 days later
  2. Ahmed Rahman· Full StackKind words

    concise + opinionated = my favourite kind of engineering post. so many blogs hedge every claim into mush. give me the spicy take with the receipts. more please.

    Jun 12, 2026·1 day later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email