ML
Elasticsearch

Inverted Index 101: How Elasticsearch Actually Finds Things

From document to token to posting list — the data structure behind sub-millisecond full-text search.

July 30, 20257 min readElasticsearchFundamentals

Everyone says Elasticsearch is fast. The reason is not magic — it's a data structure called the inverted index, built incrementally by Lucene under every index you create.

From text to tokens

When you index a document, the analyzer breaks each field into tokens. For the default English analyzer, "The quick brown foxes!" becomes [quick, brown, fox] — lowercased, punctuation stripped, stemmed.

The posting list

Lucene then flips the map around. Instead of doc → words, it keeps word → docs:

fox    -> [doc1, doc7, doc42]
brown  -> [doc1, doc4, doc42, doc99]
quick  -> [doc1, doc7, doc12, doc42]

A query for quick brown fox intersects three sorted lists — cheap, regardless of corpus size.

Segments are immutable

Every refresh flushes a batch of indexed docs into a new immutable segment. A shard is a bag of segments. Lucene never edits a segment in place — deletes are tombstoned, updates are delete + re-insert. Periodic merges rewrite smaller segments into larger ones.

Consequences for you

  • Refresh is not free. Each refresh opens a new searchable segment. Bulk loading? Set refresh_interval=-1, load, then set it back.
  • Updates are expensive. Prefer immutable event-stream indexing patterns over frequent POST _update.
  • Force-merge with care. force_merge to 1 segment is only correct for read-only indices. Live indices recover on their own.

Mapping matters more than you think

A text field is analyzed (tokenized). A keyword field is not — it's stored verbatim and used for exact match, aggregations, and sorting. You almost always want both via a multi-field:

{
  "status": {
    "type": "text",
    "fields": { "raw": { "type": "keyword" } }
  }
}

Get the mapping wrong on day one and you'll be reindexing on day 90.

SharePostLinkedIn

Reader Discussion

7 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Raghav Sharma· Search Engineer · FlipkartFrom experience

    rescore vs raw function_score is the single biggest relevance win I've shipped this year. +0.08 NDCG@10 on product search, zero recall regression, two weeks of work. If you have a serious search funnel and you're not rescoring, you're leaving money on the table.

    Aug 01, 2025·2 days later
  2. Nora Eriksen· Search PlatformFrom experience

    shard sizing rule of thumb people forget: 30-50GB per primary shard. We had 800 tiny shards on a 12-node cluster — heap was constantly under pressure and nobody knew why. consolidating to ~80 shards fixed it overnight.

    Aug 03, 2025·4 days later
  3. Pierre Lambert· Senior EngineerAgrees

    people obsess over the query and ignore the analyzer. half the relevance bugs I've debugged were because the index analyzer and the search analyzer disagreed. boring fix, huge wins.

    Aug 05, 2025·6 days later
  4. Rashida Hassan· ML EngAsks

    would love a follow-up on hybrid bm25 + dense vector search in 8.x. we A/B'd it last quarter and the BM25 head still wins on long-tail queries by a surprising margin.

    Aug 06, 2025·1 week later
  5. Olivia Bennett· Data EngineerStory

    Mapping explosion off untrusted JSON cost us a 2-billion-doc reindex last year. Six engineers, two weekends, one director apology email. "Never ingest untrusted JSON" should be on the office wall in 72pt.

    Aug 03, 2025·4 days later
  6. Léa Dubois· SREAsks

    any chance you'd publish these as a PDF collection? would love to print and read offline on flights. screen-fatigue is real.

    Aug 05, 2025·6 days later
  7. Ahmed Rahman· Full StackKind words

    concise + opinionated = my favourite kind of engineering post. so many blogs hedge every claim into mush. give me the spicy take with the receipts. more please.

    Jul 31, 2025·1 day later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email