Sooner or later every client asks the same thing, in slightly different words: "Can we have a ChatGPT that knows our stuff?" Their handbook, their tickets, their product docs, their five years of project files. The honest, un-hyped answer is yes — and the technique has a three-letter name that's collected more buzzwords than it deserves: RAG.

Let's deflate it. RAG means: don't ask the model to remember your data — fetch the relevant bits at question time and hand them over with the question. The model doesn't "learn" anything. It reads what you give it and writes an answer grounded in that text. That's the whole idea. Everything else is plumbing — but the plumbing is where projects live or die.

Why the weekend demo lies

Here's the trap. You wire up a vector database, drop in a hundred documents, ask three questions, and it answers beautifully. You demo it Friday, the client is thrilled, and you quote a launch date. Three months later, real users are asking real questions, recall has quietly collapsed, the assistant confidently invents a refund policy that doesn't exist, and nobody can explain why.

The demo lies because a hundred clean documents and three softball questions hide every problem. Production exposes them all at once: messy PDFs, near-duplicate pages, exact part numbers nobody can spell, questions that span two documents, and a knowledge base that changes under you. RAG that works in production is not a different idea — it's the same idea with the failure modes engineered out.

The retrieval stack that survives contact with users

Generation gets the attention, but the answer is only ever as good as the text you retrieve. Get the wrong chunks and no model on earth saves you — it just hallucinates more fluently. So almost all of the real engineering sits in retrieval. Four moves do most of the work.

1. Chunk for meaning, not for tokens

You can't hand the model whole documents, so you split them into chunks. The naive move — cut every 500 tokens — slices sentences in half and strands a definition three chunks away from the question that needs it. Chunk on structure instead: headings, sections, list boundaries. Keep a chunk about one thing, keep some overlap between neighbours, and store metadata (source, section, date) alongside the text. Oversized chunks dilute the signal with unrelated material; tiny ones lose the context that makes them answerable.

2. Hybrid search, not pure vectors

The default RAG tutorial uses only embeddings — semantic search that matches by meaning. Great for paraphrases, weak for the things businesses actually search for: invoice numbers, SKUs, error codes, surnames, an exact clause. For those you want old-fashioned keyword search — BM25. The production answer isn't "pick one"; it's run both and fuse the results. Hybrid search is the consensus default in 2026 because real questions need both halves.

3. Rerank before you hand off

Hybrid search gets you a shortlist — say the top 50 candidates. They're roughly right, not precisely ordered. A cross-encoder reranker reads each candidate against the actual question and re-sorts by true relevance, so you pass the model the best 5–8 instead of a noisy 50. It costs you something — a reranking pass typically adds tens to a couple hundred milliseconds — but it's the single highest-leverage addition to most pipelines. Less noise in, fewer hallucinations out.

4. Tell each chunk where it came from

A chunk pulled out of a 40-page document loses its context: "the rate increased by 3%" — which rate, which year? Anthropic's contextual retrieval technique fixes this cheaply: before indexing, prepend a one-line, model-written summary situating each chunk in its document. In their published benchmarks, contextual embeddings cut the top-20 retrieval failure rate by 35%; combined with contextual BM25 the drop was 49%, and adding reranking on top brought it to 67%. That's a large quality gain for a one-time indexing cost.

The answer is only ever as good as the text you retrieve. Most of RAG isn't AI — it's search done properly.

The step everyone skips: evaluation

This is the difference between a demo and a product, and it's the least glamorous part, so it's the first thing to get cut. Don't let it.

Build a golden set: 50 to 200 real question/answer pairs, drawn from actual user traffic, not invented at your desk. Every time you change a chunk size, swap an embedding model, or tweak a prompt, you score against the golden set and see whether quality went up or down — instead of guessing from three questions that happened to work.

And measure the two halves separately. When an answer is wrong, there are only two suspects: either retrieval didn't fetch the right chunk (no model can fix that), or it did and generation still got it wrong (a prompt or grounding problem). Conflate them and every bug is a five-hour mystery; separate them and it's a five-minute triage. This split — retrieval quality vs. generation quality — is the most useful instrument you can build.

Keeping it honest in the EU

One thing the tutorials never mention: your search index is a copy of the client's data, sitting in a new place. That has consequences you own.

  • Access control follows the data. If a user can't see a document in the source system, retrieval must not surface its text either. Filter the index by the user's permissions — a copilot that leaks HR files across departments is a breach, not a feature.
  • Residency is a hosting decision. If the data has to stay in the EU, the embedding model, the vector store and the generation model all need to be in an EU region — or on the client's own infrastructure. Decide it before the first prompt, not after the pilot.
  • The GDPR still applies. PII in the documents is PII in the index. Lawful basis, retention and a data subject's right to deletion all reach into your retrieval store too.

None of this is exotic; it's the same discipline as any data project. We walk through it on our EU Ready page, and the model-hosting half of the decision is in our piece on picking the model stack for EU clients.

The boring stack we actually ship

Stripped of hype, a production copilot on company data is a short, dull list — and dull is the point:

  1. Structure-aware chunking with metadata and overlap.
  2. Hybrid retrieval — embeddings + BM25 — fused into one shortlist.
  3. A cross-encoder reranker down to the best handful of chunks.
  4. Contextual snippets so each chunk knows where it came from.
  5. A grounded prompt that cites its sources and is allowed to say "I don't know".
  6. A golden eval set and separate retrieval/generation scoring on every change.
  7. Re-indexing on document change, and access filtering on every query.

No GraphRAG, no agent swarm, no eight-week research project — those have their place, but they're rarely the bottleneck. The bottleneck is search done properly and measured honestly. That's a build we can ship in weeks, white-label, under your name.

FAQ

Do we need to fine-tune the model on the client's data?

Almost never. Fine-tuning teaches a model style and behaviour, not facts — and it bakes data in until the next training run, so it goes stale and can't be un-learned. For "answer questions on our documents", retrieval is the right tool: you fetch the relevant text at query time, so updates are just a re-index.

Pure vector search felt fine in the demo. Why add keyword search?

Embeddings are great at meaning but weak at exact tokens — product codes, error numbers, names, acronyms. BM25 keyword search nails those and misses the paraphrases embeddings catch. Running both and fusing the results (hybrid search) is the production default, because real questions need both.

How do we know it's actually working and not quietly degrading?

Build a golden set of 50–200 real question/answer pairs from actual user traffic and score every change against it. Measure retrieval and generation separately: if the right chunk wasn't fetched, no model fixes it; if it was fetched and the answer is still wrong, that's a prompt problem. Without this you're flying blind.

Is our company data safe if we use RAG?

Only if you design for it. Your index is a copy of the source data, so it inherits the same access rules — filter retrieval by the user's permissions, host the index and model in an EU region when residency matters, and keep PII handling under the GDPR. RAG doesn't change your data-protection duties; it moves the data somewhere new.

Why does the answer cite a document that was updated last month with the old text?

Stale index. The source changed but the index never re-processed it, so retrieval returns confident, outdated answers. Re-indexing on document change — not a slow nightly cron alone — is part of the build, not an afterthought.

Sources: Anthropic, Introducing Contextual Retrieval (failure-rate benchmarks) · PremAI, Building Production RAG: architecture, chunking, evaluation & monitoring (2026) · StackAI, RAG best practices: chunking, embeddings, reranking and hybrid search.

Got a client asking for a copilot on their own data? Plan a call — we'll build the retrieval stack and hand it over under your brand.