Kibana Alerting: Rules, Connectors, and the Noise Problem
Alerts are only useful if they wake people up for real problems. Here's how to build that with Kibana's rules engine.
Kibana Alerting (the rules engine formerly called Watcher-in-the-UI) is two pieces: rules that evaluate on a schedule and connectors that deliver actions. Every post-incident "why did we miss that?" meeting comes down to one of the two.
Rule types you will actually use
- Elasticsearch query — threshold over a KQL/Lucene search. 90% of alerts.
- Index threshold — an aggregation-shaped version: "avg(latency) > 500 over 5m."
- Metric threshold / Inventory — for Metrics/Infrastructure apps.
- Log threshold — pattern matching on log fields with group-by.
The three knobs that kill noise
- Look-back vs check-every. Check every minute over a 5-minute window. Don't check every 5 minutes — you'll miss 1-minute spikes.
- Group by high-cardinality dimensions. An alert "API latency > 500ms" fires every time one endpoint regresses. Group by
service.nameandendpointso you get one page per root cause. - Throttle and recovery. Set an action frequency (notify on status change) and a recovery action. Otherwise a flapping metric spams chat.
Connector discipline
Keep two connectors per destination: on-call (PagerDuty, Opsgenie) and fyi (Slack channel). Route severity-1 rules to on-call and everything else to fyi. A single connector for all alerts is how alert fatigue gets entrenched.
Rule as code
The UI is fine for the first dozen rules. Past that, export and commit:
GET /api/alerting/rule/_find?per_page=1000
# save to repo, diff on PR, re-import with the import API
Rules are saved objects. Treat them like code.
Checklist before you ship an alert
- Is there a runbook link in the alert body?
- Does it include the query that triggered it?
- Does a test documentation create one alert, not 50?
- Is there a recovery notification?