Retrieval-augmented generation is the most common way enterprises put their own knowledge behind an LLM. It is also the most commonly broken. A demo that answers three curated questions perfectly can collapse in production — confidently citing the wrong policy, blending two contracts, or inventing a clause that doesn't exist. The model isn't lying; the retrieval layer is feeding it garbage.
Why naive RAG fails
The default tutorial pipeline — dump documents, split into fixed-size chunks, embed, fetch top-k by cosine similarity — works just well enough to demo and just badly enough to erode trust. The failure modes are predictable:
- Chunk boundaries cut meaning in half: A 500-token window splits a table from its header or a clause from its definition.
- Semantic search misses exact terms: Vector similarity is great for concepts and terrible for part numbers, statute references, and acronyms.
- Top-k is a blunt instrument: The right answer might be the 7th result, but you only passed the top 4.
- No grounding check: Nothing verifies that the generated answer is actually supported by the retrieved text.
The architecture that holds up
Trustworthy RAG is less about a clever model and more about a disciplined retrieval stack.
1. Chunk with structure, not character counts
Split on semantic boundaries — headings, sections, table rows — and attach metadata (source, section, date) to every chunk. Keep a small overlap so context isn't severed at the seam.
2. Hybrid search beats pure vectors
Combine dense vector search with keyword/BM25 search. The vector side catches paraphrase and concept; the keyword side catches the exact SKU or section number. Fuse the results.
3. Re-rank before you generate
Pull a wide candidate set (say, top 30), then use a cross-encoder re-ranker to score each candidate against the actual query. Pass only the genuinely relevant chunks to the model. This single step removes most hallucinations.
4. Force citations and verify grounding
Require the model to cite the chunk IDs it used, then programmatically check that the claims trace back to retrieved text. If support is missing, return "I don't have that information" instead of guessing. A system that knows when to stay silent is worth more than one that's confidently wrong.
Measure retrieval, not vibes
You cannot improve what you don't measure. Build a small evaluation set of real questions with known-correct source passages, then track retrieval recall (did we fetch the right chunk?) separately from answer faithfulness (did the model stick to it?). Most teams obsess over the model and ignore that 80% of their failures are retrieval misses.
Keep sensitive knowledge in-house
RAG means your most valuable documents — contracts, patient records, internal playbooks — are being embedded and queried constantly. Running the embedding model and vector store on infrastructure you control keeps that corpus inside your perimeter, which is exactly where regulated data needs to stay.
The takeaway
Good RAG is engineering, not magic. Structured chunking, hybrid retrieval, re-ranking, and grounding checks turn a flaky demo into a system your team will actually trust. The model gets the credit, but the retrieval layer does the work.
We design grounded retrieval systems on private infrastructure — so the answers are accurate, cited, and your data never leaves the building.
Stay ahead of the curve
Get our next deep-dive in your inbox