RAG Without the Hallucinations: Retrieval Patterns That Hold Up
Retrieval-augmented generation is only as good as what you retrieve. The model isn't the hard part — chunking, hybrid search, and reranking are.
Most "the LLM hallucinated" complaints in a RAG system aren't model problems — they're retrieval problems. If the right chunk never makes it into the context window, no amount of prompt engineering saves you. The interesting engineering in RAG lives entirely in the retrieval pipeline.
1. Chunking decides your ceiling
Embed too-large chunks and a single vector blurs many ideas together; too-small and you lose the context that makes a passage meaningful. Start with semantic chunking on structural boundaries (headings, paragraphs) rather than a blind 512-token split, and keep a small overlap so a fact that straddles a boundary survives.
chunk = { id, text, embedding, metadata: { source, heading, doc_id } }
// overlap ~10-15% so boundary-spanning facts aren't cut in half
2. Pure vector search isn't enough
Dense embeddings are great at meaning, bad at exact tokens — product codes, error names, rare acronyms. Keyword search (BM25) is the opposite. Hybrid search runs both and fuses the rankings, usually with Reciprocal Rank Fusion.
score(doc) = Σ 1 / (k + rank_i(doc)) // RRF across vector + BM25 lists
// k≈60; no score normalisation needed
3. Retrieve wide, then rerank
Vector + BM25 give you recall; a cross-encoder reranker gives you precision. Pull the top 50 candidates, then have a reranker score each (query, chunk) pair directly and keep the top 5. This two-stage funnel is the single biggest quality lever in most RAG systems.
candidates = hybrid_search(query, k=50)
reranked = cross_encoder.rank(query, candidates)
context = reranked[:5]
4. Give the model an exit
Tell the model it's allowed to say "I don't know," and make it cite. A prompt that demands an answer from weak context manufactures one. Citations also give you a cheap eval signal: if the cited chunk doesn't contain the claim, you caught a hallucination.
System: Answer ONLY from the context. Cite chunk ids like [3].
If the context doesn't contain the answer, say you don't know.
5. Evaluate retrieval separately from generation
Split your metrics. Measure retrieval with hit-rate and MRR on a labelled question→chunk set. Measure generation with faithfulness (does the answer follow from the context) and answer-relevance. When quality drops, you'll know which half to fix instead of randomly swapping models.
Rules of thumb
- Spend your time on chunking and retrieval before you touch the model or the prompt.
- Always go hybrid — dense for meaning, sparse for exact tokens.
- Retrieve ~50, rerank to ~5. The reranker is where precision comes from.
- Force citations and allow "I don't know." Both turn hallucinations into detectable failures.
- Track retrieval metrics independently from generation metrics, or you're debugging blind.