ML
Elasticsearch

Aggregations Deep Dive: Bucket, Metric, and the Pitfalls

Aggregations are the analytics engine behind Kibana. Know how terms, date_histogram, and cardinality actually execute.

October 18, 20258 min readElasticsearchAnalytics

Aggregations turn an Elasticsearch cluster into a sharded analytics engine. They're also where most "why is Kibana slow?" investigations end.

The two families

  • Bucket aggregations group documents: terms, date_histogram, range, composite.
  • Metric aggregations compute over a bucket: avg, percentiles, cardinality.

You nest them: bucket by hour, then compute p95 latency per bucket.

Terms agg is an approximation

Every shard returns its top shard_size (default size · 1.5 + 10) and the coordinator merges them. This means terms can return inaccurate counts for long-tail buckets. The result includes doc_count_error_upper_bound for exactly this reason.

For exhaustive pagination over high-cardinality fields, use composite:

{
  "size": 0,
  "aggs": {
    "pages": {
      "composite": {
        "size": 1000,
        "sources": [{ "host": { "terms": { "field": "host.keyword" } } }]
      }
    }
  }
}

Cardinality uses HyperLogLog

cardinality is approximate (tunable via precision_threshold). For small sets it's exact; above the threshold error grows to ~1–2%. If you genuinely need exact distinct counts, you're looking at a different tool (a SQL warehouse).

date_histogram alignment

fixed_interval: 1h gives deterministic buckets aligned to epoch. calendar_interval: 1d is timezone-aware and boundary-aligned. Mixing the two across dashboards is a classic source of "my totals don't match."

Common performance mistakes

  1. Running a terms agg on a text field. Always use the .keyword sub-field; otherwise you aggregate on analysed tokens.
  2. Requesting too many buckets. A 1-second histogram over 30 days returns 2.6M buckets — the coordinator will OOM before it can respond.
  3. Forgetting "size": 0. If you only want the aggregation, fetching documents is wasted work.

Transforms: pre-aggregated indices

For dashboards that rerun the same aggregation every 10 seconds, create a transform job. It materialises a compact summary index on a schedule; the dashboard queries the summary, not the raw logs. Cost drops by orders of magnitude.

SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Raghav Sharma· Search Engineer · FlipkartFrom experience

    rescore vs raw function_score is the single biggest relevance win I've shipped this year. +0.08 NDCG@10 on product search, zero recall regression, two weeks of work. If you have a serious search funnel and you're not rescoring, you're leaving money on the table.

    Oct 20, 2025·2 days later
  2. Michiel de Vries· Observability LeadAgrees

    ILM + rollover turned our cluster from a 4-times-a-week pager generator into something on-call literally forgets exists for months at a time. If a post wants one takeaway it should be this.

    Oct 21, 2025·3 days later
  3. Thành Võ· BackendPushback

    tiny correction — bm25 k1 is term frequency saturation, không phải boost weight như nhiều bài blog ES nhầm. b mới là length normalization. nói rõ thì người đọc tune đỡ sai.

    Oct 26, 2025·1 week later·edited
  4. Kenji Itō· Staff EngineerFrom experience

    Transform jobs are the single cheapest dashboard win in Elastic. We had a Kibana panel taking 4.2s to load on cold-cache; materialised the same 15-min summary into a transform index, dropped to 140ms. Three lines of YAML.

    Oct 23, 2025·5 days later
  5. Nora Eriksen· Search PlatformFrom experience

    shard sizing rule of thumb people forget: 30-50GB per primary shard. We had 800 tiny shards on a 12-node cluster — heap was constantly under pressure and nobody knew why. consolidating to ~80 shards fixed it overnight.

    Oct 22, 2025·4 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Oct 20, 2025·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Oct 22, 2025·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email