Why does your RAG fall apart the moment you ship it?

There's a very familiar story. A team builds a RAG demo — let's say the company's customer-info chatbot. They drop 10-15 PDFs into a vector DB, ask questions, get answers. They demo it to leadership. Leadership likes it. Green light.

Then they ship it. Not with 10 documents but with 10,000. And it all falls apart.

Answers suddenly cite the wrong document. A user asks something about the 2024 policy and gets quoted the 2019 one. Random colleagues see each other's data. Latency goes to 8 seconds. The bill is three times what they expected.

And the team is bewildered. "But it worked in the demo…"

The demo isn't production

Here's how I usually explain it: RAG is a very strange technology because the demo version is genuinely simple. A LangChain quickstart, half an hour, it works. That's why it's so misleading. Everyone assumes that what works on 10 documents will work on 10,000. It won't.

What works in the demo is 10 carefully curated PDFs that almost certainly contain the answer. Vector search there can hardly even pick wrong — every chunk is relevant, because every chunk is about the topic anyway.

On 10,000 documents the game is different. For your question, the top-5 will contain a truly relevant chunk and four that just look similar semantically. The model "picks" from both. Then it cites. Then it hallucinates.

And it's not because the vector DB is bad. It's because vector search is fast but noisy. The demo didn't show it because there was nothing to confuse it with. Production shows it because there is.

The first 3 things that will go wrong

I don't want to turn this into a RAG tutorial — we wrote up the details in the knowledge base piece. But there are three things that will show up, and that are always worth thinking about early.

Chunking. How you cut documents into pieces. Most teams cut on fixed character counts because that's what the quickstart does. This slices sentences, tables and code blocks in half. The model later doesn't understand what the chunk is about because the start or end is missing. Cut on semantic boundaries — paragraph, section, header. It's not hard. It just takes discipline.

The "tenant A sees tenant B's data" bug. The most common security hole in multi-tenant systems. The naive approach: query the top 10 chunks, then filter out anything that isn't yours in JavaScript. Never do this. The filter has to run inside the vector DB, at query level. Because, one, if you accidentally skip the JS filter — data leak. Two, if all top 10 belong to another tenant, you get an empty answer and don't know why.

Permission to say "I don't know". If your system prompt doesn't explicitly say "if the source doesn't contain the answer, say so" — the model will always produce an answer. Even if the chunks don't contain it. Because its default personality is "helpful assistant", and a helpful assistant answers. If you don't tell it that not answering is allowed, it never will.

Measurement

This is the thing I most rarely see built in. The team builds the RAG, ships it, and never measures it. "It works" — until a customer complains. Then panic, troubleshooting, and no one understands why it broke, because there's no baseline. Compared to what did it break?

The fix: 20-50 hand-written question-expected-answer pairs, run against the system every week or after every change. If the new chunking strategy drops Recall@5 from 0.84 to 0.71, you see it immediately and roll it back. Not three weeks later via a complaint.

It's not rocket science. A few hours of work at the start, and a few minutes per release after that. In return your system isn't running blind.

The double standard

And here's what, in my view, makes a lot of RAG projects collapse: the team measures by two different yardsticks.

They measure the demo by whether the answer "subjectively looks good". They measure production… not at all. But leadership's expectation is the same as if they did: 95% accurate answers, always.

So the team runs a system on real customers without being able to tell whether last week it was right 60% of the time or 90%. They only find out when someone asks. And by then it's too late.

A good RAG system isn't clever. It's disciplined. You measure every layer, you check every change, you log every error. That discipline is much easier to bake in early than to retrofit. Two days at the start. Two months later.

And the last 2-5% is always human-in-the-loop. Accept that. Anyone promising a 100% autonomous RAG system either didn't understand it or is lying.

If you want to go deeper: we worked out the full architecture — chunking, hybrid search, re-ranking, query rewriting, eval suite, production gotchas, latency budget — in our knowledge base piece on RAG in Practice, with TypeScript code examples.

Why does your RAG fall apart the moment you ship it?

The demo isn't production

The first 3 things that will go wrong

Measurement

The double standard

Related Articles

The invisible prompt injection — what your WAF will never catch

AI doesn't lie — it just doesn't know that it doesn't know

Why Security Is the Most Important AI Question — Data Flow in an AI System