Aggregations Deep Dive: Bucket, Metric, and the Pitfalls
Aggregations are the analytics engine behind Kibana. Know how terms, date_histogram, and cardinality actually execute.
Aggregations turn an Elasticsearch cluster into a sharded analytics engine. They're also where most "why is Kibana slow?" investigations end.
The two families
- Bucket aggregations group documents:
terms,date_histogram,range,composite. - Metric aggregations compute over a bucket:
avg,percentiles,cardinality.
You nest them: bucket by hour, then compute p95 latency per bucket.
Terms agg is an approximation
Every shard returns its top shard_size (default size · 1.5 + 10) and the coordinator merges them. This means terms can return inaccurate counts for long-tail buckets. The result includes doc_count_error_upper_bound for exactly this reason.
For exhaustive pagination over high-cardinality fields, use composite:
{
"size": 0,
"aggs": {
"pages": {
"composite": {
"size": 1000,
"sources": [{ "host": { "terms": { "field": "host.keyword" } } }]
}
}
}
}
Cardinality uses HyperLogLog
cardinality is approximate (tunable via precision_threshold). For small sets it's exact; above the threshold error grows to ~1–2%. If you genuinely need exact distinct counts, you're looking at a different tool (a SQL warehouse).
date_histogram alignment
fixed_interval: 1h gives deterministic buckets aligned to epoch. calendar_interval: 1d is timezone-aware and boundary-aligned. Mixing the two across dashboards is a classic source of "my totals don't match."
Common performance mistakes
- Running a
termsagg on atextfield. Always use the.keywordsub-field; otherwise you aggregate on analysed tokens. - Requesting too many buckets. A 1-second histogram over 30 days returns 2.6M buckets — the coordinator will OOM before it can respond.
- Forgetting
"size": 0. If you only want the aggregation, fetching documents is wasted work.
Transforms: pre-aggregated indices
For dashboards that rerun the same aggregation every 10 seconds, create a transform job. It materialises a compact summary index on a schedule; the dashboard queries the summary, not the raw logs. Cost drops by orders of magnitude.