ML
Elasticsearch

When Elastic Refused to Search: The Mapping Bomb Diary

A vendor pushed 47 million docs of untyped JSON into our index. Field count exploded past 12,000. Here's how we crawled out, day by day.

December 08, 202513 min readElasticsearchPostmortemIndexing

This one didn't happen overnight. It happened over nine days, like a slow-motion car crash you watch through binoculars while sipping coffee. By the end I had a postmortem doc, an internal training, and a permanent allergy to the words schemaless and just ingest the JSON.

Day 0: "Just point it at the new vendor feed"

The product team had inked a deal with a third-party data vendor. They'd send us product enrichment data — reviews, sentiment, related-product graphs — as JSON over an SQS queue. We'd stuff it into Elastic, surface it in search, and call it a feature.

I asked twice for a schema. I got back: "It's just JSON, you can index it as-is." Famous last words.

Day 1: First 200k docs in

Looked fine. Search worked. Cluster green. Two of the analytics team did a happy dance.

What I didn't notice: the field count on the new index had hit 870 in the first day. Default index.mapping.total_fields.limit in our cluster was 1,000. We had budgeted for 200, max.

Day 2: "Why is everyone's search slow?"

Latency on the main product index — a different index — was up. Not catastrophic, but clearly worse. Cluster CPU was spiked across all data nodes. The new ingest had pushed our hot tier into memory pressure.

Each new field allocation in Elastic is not free. The field data structure, the term dictionary, the doc-values columns — they all cost RAM. With 870 fields, our heap usage on the hot nodes had jumped 4 GB. Garbage collection was working overtime.

Day 3: The mapping limit hits

The vendor had been adding fields organically. Day 3, an ingest batch added field number 1,001. The cluster started rejecting docs:

"type": "illegal_argument_exception",
"reason": "Limit of total fields [1000] in index
  [vendor_enrichment] has been exceeded"

Two paths: bump the limit, or fix the ingest. I bumped the limit to 5,000 "as a temporary measure." Reader, it was not temporary.

Day 4-5: The weekend of horror

By Sunday afternoon the field count was 7,200. The data was being indexed because some review documents had nested JSON like:

{
  "review_id": "abc",
  "metadata": {
    "scrape_2025_11_18_14_22_run_4": "ok",
    "internal_tag_xj3p9": "true",
    "vendor_session_a8f0e21": "expired"
  }
}

The vendor was, charmingly, embedding their internal scrape session IDs as JSON keys. Each new scrape ran created roughly 30 new fields. Multiply by their batch size and we were looking at thousands of fields per day, forever.

This is the textbook mapping explosion. The fix: do not let the source dictate your schema.

Day 6: The decision

I had two real options.

Option A: Reindex with a strict schema. Take only the fields we cared about (review text, sentiment score, product ID, timestamp). Discard the rest, or stuff them into a flattened field type that doesn't blow up the mapping. Reindex 47M docs. ETA: 30 hours.

Option B: Use dynamic templates to coerce. Cleverer. Set up a dynamic template that mapped any field starting with scrape_ or vendor_session_ as type: keyword, doc_values: false, index: false. Costs disk, saves heap. ETA: 2 hours.

I went with B for the immediate bleeding, and lined up A as the proper fix.

"dynamic_templates": [
  {
    "scrape_metadata": {
      "match": "metadata.scrape_*",
      "mapping": { "type": "keyword", "index": false, "doc_values": false }
    }
  },
  {
    "no_index_internal": {
      "match": "metadata.internal_*",
      "mapping": { "type": "object", "enabled": false }
    }
  }
]

Heap usage dropped within an hour. Field count growth slowed but didn't stop — there were other classes of garbage I hadn't templated for.

Day 7-8: The reindex

I built a new index with a strict mapping — 73 hand-picked fields, all the noise mapped to a single flattened type called raw_metadata. flattened is a beautiful escape hatch: it indexes nested JSON as a single field, you can still query individual sub-keys, but Elastic doesn't track them in the mapping. It's the right tool when the source schema is enemy territory.

Reindex took 33 hours. Failover at 04:30 ICT on day 8. Aliases swung. Cluster CPU dropped to baseline. We were back.

Day 9: The rules I added to the team handbook

  1. Never ingest untrusted JSON without an explicit schema. If the vendor refuses, transform at the edge. The cost of the transformation is always less than the cost of a mapping explosion.
  2. Set a strict total_fields.limit. Default 1,000 is generous. We run 500 in production now. Hitting the limit fails fast, which is what we want.
  3. Use flattened for variable-shape blobs. If you genuinely don't know what's in the data, flattened is the type you want. You give up some search granularity. You keep your cluster.
  4. Watch field count as a first-class metric. Right alongside heap, CPU, and indexing latency. We page when an index crosses 80% of its mapping limit.
  5. The vendor relationship is a technical contract. The data they send is the contract. We now require a schema doc before we'll ingest. Sales hates this. Engineering loves it.

The thing I keep coming back to

Elastic is permissive by design. That's the feature. You can throw any document at it and it'll figure out the mapping. That permissiveness is also what makes it dangerous — it lets a poorly-controlled upstream wreak havoc on a perfectly-controlled cluster. The discipline isn't on Elastic; it's on you.

The single sentence I now write at the top of every Elastic onboarding doc: your schema is whatever your noisiest producer wants it to be, unless you take it back. Take it back.

SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Raghav Sharma· Search Engineer · FlipkartFrom experience

    rescore vs raw function_score is the single biggest relevance win I've shipped this year. +0.08 NDCG@10 on product search, zero recall regression, two weeks of work. If you have a serious search funnel and you're not rescoring, you're leaving money on the table.

    Dec 10, 2025·2 days later
  2. Pierre Lambert· Senior EngineerAgrees

    people obsess over the query and ignore the analyzer. half the relevance bugs I've debugged were because the index analyzer and the search analyzer disagreed. boring fix, huge wins.

    Dec 14, 2025·6 days later
  3. Rashida Hassan· ML EngAsks

    would love a follow-up on hybrid bm25 + dense vector search in 8.x. we A/B'd it last quarter and the BM25 head still wins on long-tail queries by a surprising margin.

    Dec 15, 2025·1 week later
  4. Olivia Bennett· Data EngineerStory

    Mapping explosion off untrusted JSON cost us a 2-billion-doc reindex last year. Six engineers, two weekends, one director apology email. "Never ingest untrusted JSON" should be on the office wall in 72pt.

    Dec 12, 2025·4 days later
  5. Michiel de Vries· Observability LeadAgrees

    ILM + rollover turned our cluster from a 4-times-a-week pager generator into something on-call literally forgets exists for months at a time. If a post wants one takeaway it should be this.

    Dec 11, 2025·3 days later
  6. Isabella Costa· Junior EngineerKind words

    saved this. sharing at standup tomorrow — we've had exactly this problem for 2 sprints and nobody on the team had framed it this way 🙏

    Dec 10, 2025·2 days later
  7. Kenta Yamada· Tech LeadAsks

    would love a war-story follow-up. principles are clear; the actual debugging session is where the interesting stuff lives. there's a real shortage of "here's the dashboard, here's the thread we pulled, here's where we got stuck for 90 mins" content.

    Dec 12, 2025·4 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email