The Outbox Pattern: Atomic DB and Queue Writes
You can't atomically commit to Postgres and Kafka. The outbox pattern is how senior backends finesse that without losing events.
Sooner or later every backend hits this shape: a request updates the database and publishes a message. UPDATE orders SET status='paid', then kafka.send(...). Two different systems, one logical operation. Half-failures are a matter of when.
1. Why distributed transactions are not the answer
XA / two-phase commit between Postgres and Kafka is theoretically possible, practically nightmarish. Kafka does not really speak XA; the coordinator becomes a single point of failure; the latency is awful. Nobody does this in production.
2. The pattern
Instead of writing to two systems, write to one — the database — and have a separate process forward the event.
- In the same transaction as the business write, insert a row into an
outboxtable describing the event. - A relay reads new outbox rows and publishes them to Kafka.
- The relay marks rows as published (or deletes them).
BEGIN;
UPDATE orders SET status='paid' WHERE id=42;
INSERT INTO outbox (id, aggregate_id, event_type, payload, created_at)
VALUES (gen_random_uuid(), 42, 'OrderPaid', $1::jsonb, now());
COMMIT;
The business write and the event are now atomic — they either both commit or neither does. The "publish to Kafka" step is no longer in the critical path.
3. The relay
Two options, both real:
3.1 Polling relay
A worker periodically selects unpublished rows and sends them. Simple, robust, slightly laggy.
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 100
FOR UPDATE SKIP LOCKED;
FOR UPDATE SKIP LOCKED is the magic — multiple relay workers can run safely without stepping on each other.
3.2 Change Data Capture relay
Use Debezium (or pg_replication) to tail the Postgres write-ahead log and publish outbox inserts to Kafka. No polling latency, no application code, but you now own a CDC pipeline.
4. Exactly-once is still not free
The relay might publish a message, crash before marking it published, and re-publish on restart. That is at-least-once. Consumers must be idempotent — usually by deduping on the outbox row ID.
Combine outbox with Kafka transactional producers and you can promise consumers exactly-once delivery, but they still need idempotent processing if anything downstream of Kafka is non-transactional.
5. The shape of the row
Schema-wise, two flavours are common:
- Envelope-only — outbox stores event metadata + ID, the actual payload is re-read from the aggregate at publish time. Smaller, but the aggregate must be unchanged or your event reflects a later state.
- Full payload — store the serialised event in the outbox row. Safer (point-in-time snapshot) and what I recommend by default.
6. Cleanup
The outbox table will grow forever if you let it. Either delete on publish (acceptable — you lose the audit trail in the DB but Kafka has the event) or move to an archive table on a TTL.
What this gives you
- Atomic business write + event emission.
- No lost events, even on crash.
- Independence from Kafka availability — the database absorbs the burst, the relay drains it later.
- A natural audit trail.
What you still owe consumers
Idempotency. Always. The outbox solves "did the event happen?" — it does not solve "did the consumer process it exactly once?" That is a separate piece of homework, usually a processed_events table on the consumer side.