Scoring & Relevance: BM25, Boosting, and Rescoring That Works
Default BM25 is a strong baseline. Here's how to tune it, combine it with business signals, and not destroy recall.
Elasticsearch switched from TF-IDF to BM25 in 5.0. BM25 is a better default because it diminishes the return on term frequency — a document that mentions a word 50 times isn't 50× more relevant than one that mentions it once.
The BM25 formula in one line
score = IDF · tf · (k1 + 1) / (tf + k1 · (1 − b + b · dl / avgdl))
Two knobs you can actually tune:
k1(default 1.2): how quickly TF saturates. Higher → longer documents win more.b(default 0.75): length normalisation. Lower (e.g. 0.3) is better for short fields like titles.
Per-field similarity
"settings": {
"index": {
"similarity": {
"title_bm25": { "type": "BM25", "k1": 0.9, "b": 0.3 }
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "similarity": "title_bm25" }
}
}
Combining signals without breaking relevance
Business wants "boost recent," "boost popular," "boost our paying tenants." The worst way is to sum everything into a single function_score with hand-tuned multipliers — relevance collapses as the weights drift.
The right pattern is rescore: let BM25 find a candidate set, then re-rank the top N with expensive signals.
{
"query": { "match": { "body": "observability stack" } },
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
"functions": [
{ "gauss": { "published_at": { "origin": "now", "scale": "30d" } } },
{ "field_value_factor": { "field": "clicks", "modifier": "log1p", "factor": 0.2 } }
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"query_weight": 0.7,
"rescore_query_weight": 0.3
}
}
}
Measuring, not guessing
Use the Ranking Evaluation API with a set of judged queries. Track NDCG@10 over deploys. Anything you ship without a scoreboard will drift.
When BM25 isn't enough
For semantic recall (synonyms, paraphrase, multilingual), add a dense vector field and use hybrid search: BM25 + kNN, combined with Reciprocal Rank Fusion. That's a separate article — but the point is: don't try to encode meaning with keyword boosts alone.