SLOs and Error Budgets: Alerting on What Users Actually Feel
Alerting on CPU and disk pages people for things users never notice. SLOs flip it around: define what 'working' means to a user, then spend an error budget against it.
The fastest way to make on-call miserable is to alert on causes — CPU, memory, disk, queue depth. Most of those fire when nothing user-visible is wrong, and they miss the outages that don't move a host metric. SLOs reframe alerting around symptoms: did the user get a correct response, fast enough?
1. SLI, SLO, SLA
- SLI — a Service Level Indicator: a measured ratio of good events to total. "Requests served in <300ms with a 2xx/3xx" ÷ "all valid requests".
- SLO — the Objective: the target for that SLI, e.g. 99.9% over 28 days.
- SLA — the contractual version with penalties. Your SLO should be stricter than your SLA.
2. Good SLIs are user-journey ratios
Pick indicators a user would recognise as "it worked": availability (success ratio), latency (fast-enough ratio), and sometimes correctness or freshness. Measure them as close to the user as you can — at the load balancer or client, not deep inside one service where you'll miss failures in the layers above.
3. The error budget
A 99.9% SLO over 28 days is an explicit budget to fail 0.1% of the time — about 40 minutes a month. That budget is the most useful number in the whole framework: it turns "is reliability good enough?" into arithmetic, and it aligns engineering and product. Budget left over? Ship faster, take risks. Budget burned? Freeze features and fix reliability.
budget = (1 - 0.999) × 28d ≈ 40 min / month
burned = minutes the SLI was below target
4. Alert on burn rate, not on every dip
Don't page when the SLI blips. Page when you're burning the budget fast enough to run out. Multi-window, multi-burn-rate alerts catch both the sudden outage and the slow bleed without drowning you in noise.
Fast burn: 14.4× rate over 1h → page now (budget gone in ~2 days)
Slow burn: 3× rate over 6h → ticket, look today
5. Don't chase 100%
Every extra nine costs roughly 10× more and buys reliability users can't perceive — their phone, wifi, and your upstreams are already less reliable than 99.99%. Set the SLO at the point where users stop caring, not at the limit of what's technically possible.
Rules of thumb
- Alert on symptoms (success ratio, latency), not causes (CPU, disk).
- Measure SLIs as close to the user as possible.
- Page on burn rate with multiple windows; everything else is a ticket.
- An untouched error budget means you're shipping too slowly, not that you're winning.