Inverted Index 101: How Elasticsearch Actually Finds Things
From document to token to posting list — the data structure behind sub-millisecond full-text search.
Everyone says Elasticsearch is fast. The reason is not magic — it's a data structure called the inverted index, built incrementally by Lucene under every index you create.
From text to tokens
When you index a document, the analyzer breaks each field into tokens. For the default English analyzer, "The quick brown foxes!" becomes [quick, brown, fox] — lowercased, punctuation stripped, stemmed.
The posting list
Lucene then flips the map around. Instead of doc → words, it keeps word → docs:
fox -> [doc1, doc7, doc42]
brown -> [doc1, doc4, doc42, doc99]
quick -> [doc1, doc7, doc12, doc42]
A query for quick brown fox intersects three sorted lists — cheap, regardless of corpus size.
Segments are immutable
Every refresh flushes a batch of indexed docs into a new immutable segment. A shard is a bag of segments. Lucene never edits a segment in place — deletes are tombstoned, updates are delete + re-insert. Periodic merges rewrite smaller segments into larger ones.
Consequences for you
- Refresh is not free. Each refresh opens a new searchable segment. Bulk loading? Set
refresh_interval=-1, load, then set it back. - Updates are expensive. Prefer immutable event-stream indexing patterns over frequent
POST _update. - Force-merge with care.
force_mergeto 1 segment is only correct for read-only indices. Live indices recover on their own.
Mapping matters more than you think
A text field is analyzed (tokenized). A keyword field is not — it's stored verbatim and used for exact match, aggregations, and sorting. You almost always want both via a multi-field:
{
"status": {
"type": "text",
"fields": { "raw": { "type": "keyword" } }
}
}
Get the mapping wrong on day one and you'll be reindexing on day 90.