RAG Without the Hallucinations: Retrieval Patterns That Hold Up

Retrieval-augmented generation is only as good as what you retrieve. The model isn't the hard part — chunking, hybrid search, and reranking are.

May 22, 202611 min readAISystem Design

Most "the LLM hallucinated" complaints in a RAG system aren't model problems — they're retrieval problems. If the right chunk never makes it into the context window, no amount of prompt engineering saves you. The interesting engineering in RAG lives entirely in the retrieval pipeline.

1. Chunking decides your ceiling

Embed too-large chunks and a single vector blurs many ideas together; too-small and you lose the context that makes a passage meaningful. Start with semantic chunking on structural boundaries (headings, paragraphs) rather than a blind 512-token split, and keep a small overlap so a fact that straddles a boundary survives.

chunk = { id, text, embedding, metadata: { source, heading, doc_id } }
// overlap ~10-15% so boundary-spanning facts aren't cut in half

2. Pure vector search isn't enough

Dense embeddings are great at meaning, bad at exact tokens — product codes, error names, rare acronyms. Keyword search (BM25) is the opposite. Hybrid search runs both and fuses the rankings, usually with Reciprocal Rank Fusion.

score(doc) = Σ  1 / (k + rank_i(doc))     // RRF across vector + BM25 lists
// k≈60; no score normalisation needed

3. Retrieve wide, then rerank

Vector + BM25 give you recall; a cross-encoder reranker gives you precision. Pull the top 50 candidates, then have a reranker score each (query, chunk) pair directly and keep the top 5. This two-stage funnel is the single biggest quality lever in most RAG systems.

candidates = hybrid_search(query, k=50)
reranked   = cross_encoder.rank(query, candidates)
context    = reranked[:5]

4. Give the model an exit

Tell the model it's allowed to say "I don't know," and make it cite. A prompt that demands an answer from weak context manufactures one. Citations also give you a cheap eval signal: if the cited chunk doesn't contain the claim, you caught a hallucination.

System: Answer ONLY from the context. Cite chunk ids like [3].
If the context doesn't contain the answer, say you don't know.

5. Evaluate retrieval separately from generation

Split your metrics. Measure retrieval with hit-rate and MRR on a labelled question→chunk set. Measure generation with faithfulness (does the answer follow from the context) and answer-relevance. When quality drops, you'll know which half to fix instead of randomly swapping models.

Rules of thumb

Spend your time on chunking and retrieval before you touch the model or the prompt.
Always go hybrid — dense for meaning, sparse for exact tokens.
Retrieve ~50, rerank to ~5. The reranker is where precision comes from.
Force citations and allow "I don't know." Both turn hallucinations into detectable failures.
Track retrieval metrics independently from generation metrics, or you're debugging blind.

RAG Without the Hallucinations: Retrieval Patterns That Hold Up

1. Chunking decides your ceiling

2. Pure vector search isn't enough

3. Retrieve wide, then rerank

4. Give the model an exit

5. Evaluate retrieval separately from generation

Rules of thumb

6 replies// weighed in

More from this topic

LLM Parameters Explained: Temperature, Top-P, and the Knobs That Actually Matter

Context Windows: What They Actually Cost You and How to Fit More In

Prompt Caching: Cutting LLM Cost and Latency in Half