ML
AI · LLM

RAG Without the Hallucinations: Retrieval Patterns That Hold Up

Retrieval-augmented generation is only as good as what you retrieve. The model isn't the hard part — chunking, hybrid search, and reranking are.

May 22, 202611 min readAISystem Design

Most "the LLM hallucinated" complaints in a RAG system aren't model problems — they're retrieval problems. If the right chunk never makes it into the context window, no amount of prompt engineering saves you. The interesting engineering in RAG lives entirely in the retrieval pipeline.

1. Chunking decides your ceiling

Embed too-large chunks and a single vector blurs many ideas together; too-small and you lose the context that makes a passage meaningful. Start with semantic chunking on structural boundaries (headings, paragraphs) rather than a blind 512-token split, and keep a small overlap so a fact that straddles a boundary survives.

chunk = { id, text, embedding, metadata: { source, heading, doc_id } }
// overlap ~10-15% so boundary-spanning facts aren't cut in half

2. Pure vector search isn't enough

Dense embeddings are great at meaning, bad at exact tokens — product codes, error names, rare acronyms. Keyword search (BM25) is the opposite. Hybrid search runs both and fuses the rankings, usually with Reciprocal Rank Fusion.

score(doc) = Σ  1 / (k + rank_i(doc))     // RRF across vector + BM25 lists
// k≈60; no score normalisation needed

3. Retrieve wide, then rerank

Vector + BM25 give you recall; a cross-encoder reranker gives you precision. Pull the top 50 candidates, then have a reranker score each (query, chunk) pair directly and keep the top 5. This two-stage funnel is the single biggest quality lever in most RAG systems.

candidates = hybrid_search(query, k=50)
reranked   = cross_encoder.rank(query, candidates)
context    = reranked[:5]

4. Give the model an exit

Tell the model it's allowed to say "I don't know," and make it cite. A prompt that demands an answer from weak context manufactures one. Citations also give you a cheap eval signal: if the cited chunk doesn't contain the claim, you caught a hallucination.

System: Answer ONLY from the context. Cite chunk ids like [3].
If the context doesn't contain the answer, say you don't know.

5. Evaluate retrieval separately from generation

Split your metrics. Measure retrieval with hit-rate and MRR on a labelled question→chunk set. Measure generation with faithfulness (does the answer follow from the context) and answer-relevance. When quality drops, you'll know which half to fix instead of randomly swapping models.

Rules of thumb

  • Spend your time on chunking and retrieval before you touch the model or the prompt.
  • Always go hybrid — dense for meaning, sparse for exact tokens.
  • Retrieve ~50, rerank to ~5. The reranker is where precision comes from.
  • Force citations and allow "I don't know." Both turn hallucinations into detectable failures.
  • Track retrieval metrics independently from generation metrics, or you're debugging blind.
SharePostLinkedIn

Reader Discussion

6 replies// weighed in

TopNewestAuthor
Add to the thread
Disagree, agree harder, or share your own experience…
Email instead →markdown okbe kind
  1. Highlighted by author
    Sania Patel· ML EngineerAgrees

    "temperature is a flatness dial, not a creativity dial" — saving this. so many product folks ask me to "make the AI more creative" by cranking temperature and i can finally point them at a paragraph instead of a 20-min explainer.

    May 24, 2026·2 days later
  2. Felipe Castro· BackendAsks

    Q on reasoning budgets — for a customer support chatbot, do you ever turn reasoning on, or is it always off? we A/B'd it and quality was identical at 4x the cost.

    May 29, 2026·1 week later
  3. Jorge Ramírez· Senior SWE · AI infraFrom experience

    prompt caching reordering is THE highest-leverage change in any LLM app. moved our system prompt + tool defs to the front, kept dynamic stuff at the end — 73% cost reduction overnight. zero quality change. shipped on a Tuesday.

    May 25, 2026·3 days later
  4. Mai Đỗ🇻🇳 SG· AI EngineerAgrees

    đoạn "context là budget, không phải buffet" hợp với cảm giác mình giải thích cho dev mới hàng tuần. cứ stuff full doc vô context xong than "sao chậm thế." RAG + summary gần như luôn thắng.

    May 26, 2026·4 days later
  5. Tobias Eriksson· Research EngineerPushback

    tiny addition — needle-in-a-haystack benchmarks are increasingly gamed by post-training. real-world long-context perf on multi-fact retrieval is still mediocre. trust evals on YOUR data, not vendor blog posts. (otherwise spot on.)

    May 28, 2026·6 days later
  6. Rachel Gold· Staff SREAgrees

    the on-call framing throughout this piece is what makes it land. too many infra articles assume you never get paged. those are written by people who never got paged.

    May 25, 2026·3 days later

Worked on something similar? Email ducminhldm@gmail.com — I read every one. The good ones become future posts.

Comments seeded · live discussion via email