A RAG demo takes 30 minutes. A production RAG system takes 3 months. The difference: chunking, re-ranking, evaluation and monitoring.
Intro: why "the LangChain quickstart" isn't enough
If you've ever built a RAG system, you know the script:
- You download the LangChain (or LlamaIndex) quickstart
- You ingest 10 PDFs into a vector DB
- You ask a question → you get an answer → it works
- You demo it to leadership → they like it
- You hook up 10,000 documents → it all falls apart
The problem: quickstart RAG works on a toy problem. Production RAG is a different beast: irrelevant chunks, long-context degradation, multi-tenant isolation, refresh cycles, monitoring, cost explosion.
This piece is about building a production RAG system — with concrete code, measurable best practices, and the mistakes we (and others) have already made.
Layers of the RAG architecture
Naive RAG looks like this:
Question → Embedding → Vector search → Top-K → LLM → Answer
Production RAG looks like this:
Question
↓
[Query rewriting / decomposition]
↓
[Hybrid search: vector + BM25]
↓
[Re-ranking (cross-encoder)]
↓
[Context assembly + deduplication]
↓
[LLM with structured prompt + citations]
↓
[Response validation + citation check]
↓
[Logging + evaluation]
Every layer is skippable — but every skip degrades quality. The question isn't whether you need it, but at which phase you introduce it.
Ingestion — the most underestimated part
Chunking — the most important decision
Chunk size decides system quality. Too small → context lost. Too large → relevance diluted.
Empirical sizing:
| Use case | Chunk size | Overlap |
|---|---|---|
| FAQ / short policy | 200-300 tokens | 30 tokens |
| Technical documentation | 400-600 tokens | 50 tokens |
| Long prose | 600-800 tokens | 100 tokens |
| Code | semantic (function / class) | 0 |
Bad chunking: fixed character count (e.g. 1000 chars). It cuts sentences and sections in half.
Good chunking: cuts on semantic boundaries (paragraph, section, header).
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50,
separators: [
"\n## ", // markdown header
"\n### ", // markdown subheader
"\n\n", // paragraph
"\n", // line
". ", // sentence
" ", // word (last resort)
],
});
const chunks = await splitter.splitDocuments(documents);
Metadata — the secret weapon
A chunk on its own is context-free. If the vector DB only stores content, the model doesn't know where it came from, when, who owns it.
Minimum metadata:
type ChunkMetadata = {
documentId: string;
documentTitle: string;
source: string; // URL or filename
sectionPath: string[]; // ["Chapter 3", "3.2 Pricing"]
pageNumber?: number;
createdAt: Date;
updatedAt: Date;
tenantId: string; // MANDATORY for multi-tenant
permissions?: string[]; // who can see it?
language: string;
documentType: "policy" | "faq" | "contract" | "email" | "other";
};
Metadata enables:
- Pre-search filtering (e.g. only the current tenant's documents)
- Source citation in the answer
- Freshness filtering (e.g. last 6 months only)
- Permission checks (who's allowed to see it)
Embedding model selection
| Model | Dimensions | Cost (1M tokens) | When? |
|---|---|---|---|
text-embedding-3-small |
1536 | $0.02 | Default, good price / value |
text-embedding-3-large |
3072 | $0.13 | When you need high precision |
bge-m3 (open source) |
1024 | self-host | Multilingual, on-prem |
voyage-3 |
1024 | $0.06 | Specific domains (code, legal) |
For non-English content: multilingual models (text-embedding-3-*, bge-m3) work well. Avoid English-only models (all-MiniLM-L6-v2 is ~30% worse on non-English text).
Important: if you switch models, you have to re-embed the entire corpus. Plan for it.
Retrieval — more than cosine similarity
Hybrid search: vector + keyword
Pure vector search is bad when:
- Exact phrasing matters (product code, error code, name)
- There are rare technical terms (the model doesn't know them well)
- The question is short and specific
Solution: hybrid search = vector + BM25 (keyword). Combine results with Reciprocal Rank Fusion (RRF):
function reciprocalRankFusion(
results: { id: string; rank: number }[][],
k = 60
): { id: string; score: number }[] {
const scores = new Map<string, number>();
for (const resultList of results) {
for (const item of resultList) {
const current = scores.get(item.id) || 0;
scores.set(item.id, current + 1 / (k + item.rank));
}
}
return [...scores.entries()]
.map(([id, score]) => ({ id, score }))
.sort((a, b) => b.score - a.score);
}
// Usage:
const vectorResults = await vectorDb.search(queryEmbedding, { topK: 20 });
const keywordResults = await elasticsearch.search(query, { size: 20 });
const merged = reciprocalRankFusion([
vectorResults.map((r, i) => ({ id: r.id, rank: i })),
keywordResults.map((r, i) => ({ id: r.id, rank: i })),
]);
The pgvector + tsvector combination solves it inside Postgres — no separate Elasticsearch needed.
Re-ranking — the quality jump
Vector search is fast but noisy. The top-20 results often contain 5-10 irrelevant chunks. The fix: a re-ranker model.
The re-ranker is a cross-encoder: it processes the question and each chunk together and outputs a relevance score. Slower (50-200ms), but significantly more accurate.
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function rerank(query: string, candidates: Chunk[]): Promise<Chunk[]> {
const response = await cohere.rerank({
model: "rerank-multilingual-v3.0",
query,
documents: candidates.map((c) => c.content),
topN: 5,
});
return response.results.map((r) => candidates[r.index]);
}
// Pipeline:
const top20 = await hybridSearch(query); // fast, noisy
const top5 = await rerank(query, top20); // slow, accurate
const answer = await llm(query, top5); // only top-5 reaches the LLM
Measured experience: without re-ranking, top-5 relevance ratio is ~60%; with re-ranking, ~85-90%. This directly reduces hallucination.
Query rewriting
User questions are poorly formed: short, conversational, context-dependent.
"What's the status?" ← unsearchable
"How is the Kovács project doing now?" ← searchable
Solution: use an LLM to rewrite the question based on the conversation history:
async function rewriteQuery(
history: Message[],
userQuery: string
): Promise<string[]> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini", // a cheap model is enough
messages: [{
role: "system",
content: `Generate 2-3 independent search queries from the
conversation and the user's last message.
Each query should be self-contained, interpretable without context.
Return JSON array: {"queries": ["...", "..."]}`
}, {
role: "user",
content: `History:\n${formatHistory(history)}\n\nNew message: ${userQuery}`
}],
response_format: { type: "json_object" }
});
return JSON.parse(response.choices[0].message.content).queries;
}
// Multi-query retrieval
const queries = await rewriteQuery(history, userInput);
const allResults = await Promise.all(queries.map(q => hybridSearch(q)));
const merged = reciprocalRankFusion(
allResults.map(r => r.map((c, i) => ({ id: c.id, rank: i })))
);
Generation — the prompt as a contract
You can't just glue retrieved chunks to the end of the prompt. The prompt is a contract with the model.
A good RAG system prompt
const RAG_SYSTEM_PROMPT = `You are an expert assistant.
Answer ONLY based on the provided SOURCES. Rules:
1. If the sources don't contain the answer, say:
"I have no information on this in the documentation."
Do NOT make up an answer.
2. For every factual claim, attach a source citation: [Source: <id>]
3. If sources contradict each other, flag it:
"The sources disagree: [Source A] says X,
while [Source B] says Y."
4. Do NOT rely on your training knowledge. Only on sources.
5. If the question needs clarification, ask back.`;
const userPrompt = `SOURCES:
${chunks.map((c, i) => `
[Source ${c.id} | ${c.source} | ${c.sectionPath.join(" > ")}]
${c.content}
`).join("\n---\n")}
QUESTION: ${userQuery}`;
Concrete tricks:
- Put sources first, question last (defends against lost-in-the-middle)
- Source IDs should be machine-parseable (e.g.
[Source abc123]) - Explicitly permit the "no information" answer — otherwise it'll hallucinate
Citation validation
The model can cite a non-existent source. Validate:
async function validateAndStripCitations(
answer: string,
validSourceIds: Set<string>
): Promise<{ answer: string; warnings: string[] }> {
const citationRegex = /\[Source:\s*([a-zA-Z0-9_-]+)\]/g;
const warnings: string[] = [];
const cleaned = answer.replace(citationRegex, (match, id) => {
if (!validSourceIds.has(id)) {
warnings.push(`Hallucinated source ID: ${id}`);
return ""; // strip or retry
}
return match;
});
return { answer: cleaned, warnings };
}
If warnings are frequent → retry generation with a stricter prompt.
Evaluation — without it, you're flying blind
Most teams don't measure RAG quality. "It works" — until a customer complains. Then they don't know why it broke.
The eval dataset
Build 20-100 question-answer pairs by hand, categorized:
type EvalCase = {
id: string;
question: string;
expectedAnswer: string;
expectedSources: string[]; // which chunks should come back
category: "factual" | "comparative" | "procedural" | "edge_case";
difficulty: "easy" | "medium" | "hard";
};
Retrieval metrics
function calculateRecallAtK(
retrieved: string[],
expected: string[],
k: number
): number {
const top = new Set(retrieved.slice(0, k));
const hits = expected.filter(id => top.has(id)).length;
return hits / expected.length;
}
// Per-case measurement:
for (const testCase of evalDataset) {
const retrieved = await hybridSearch(testCase.question);
const recall5 = calculateRecallAtK(
retrieved.map(r => r.id),
testCase.expectedSources,
5
);
console.log(`${testCase.id}: Recall@5 = ${recall5}`);
}
Target: Recall@5 > 0.85. Below that, the issue isn't generation — retrieval isn't finding relevant content.
Generation metrics — LLM-as-a-judge
You evaluate answer quality with an LLM:
async function evaluateAnswer(
question: string,
expectedAnswer: string,
actualAnswer: string,
sources: string[]
): Promise<EvalScore> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "system",
content: `Score the answer on (1-5):
- faithfulness: relies only on sources?
- relevance: answers the question?
- completeness: contains all needed info?
- correctness: matches expected answer?
JSON: {faithfulness, relevance, completeness, correctness, reasoning}`
}, {
role: "user",
content: `Question: ${question}
Expected: ${expectedAnswer}
Actual: ${actualAnswer}
Sources: ${sources.join("\n")}`
}],
response_format: { type: "json_object" },
temperature: 0
});
return JSON.parse(response.choices[0].message.content);
}
Industry tools: Ragas, TruLens, DeepEval. Don't roll your own unless you have to.
CI/CD integration
Run every RAG change (new chunking, new embedder, new prompt) through the eval suite:
npm run rag:eval
# Recall@5: 0.87 (was 0.84) ✓
# Faithfulness: 4.6/5 (was 4.5) ✓
# Cost per query: $0.012 (was $0.011) ⚠
Regressions surface immediately — not 2 weeks later in a customer complaint.
Production gotchas
Stale data
The vector DB doesn't refresh itself. You need:
- Webhook from the source system (CMS, Confluence, Drive) → re-embed
- Scheduled re-index (daily / weekly)
- Soft delete: drop old chunks, don't just overwrite
Multi-tenant isolation
Never let tenant A query tenant B's data:
// BAD: filter only at prompt level
const chunks = await vectorDb.search(query, { topK: 10 });
const filtered = chunks.filter(c => c.tenantId === userTenantId);
// GOOD: filter at query level (in the DB)
const chunks = await vectorDb.search(query, {
topK: 10,
filter: { tenantId: userTenantId }
});
The "filter after prompt" approach is a data-leak risk — and if the top-K all belong to a different tenant, you get an empty answer and don't know why.
Cost monitoring
In RAG, costs run away easily:
- Embedding (one-off, but in bulk)
- Vector DB hosting
- LLM calls (multiplied by context size!)
- Re-ranker calls
Rule of thumb: log token counts for every user query. Review the most expensive 1% of queries weekly — that's where 80% of the spend lives.
Latency budget
| Step | Typical latency |
|---|---|
| Query embedding | 50-100ms |
| Vector search (10K chunks) | 20-50ms |
| Vector search (10M chunks) | 100-300ms |
| BM25 | 30-100ms |
| Re-ranking (top-20) | 100-300ms |
| LLM generation (streamed) | 500ms - 3s |
Total: 1-4s P95. If that doesn't fit, stream (SSE) to the frontend.
When NOT to use RAG
RAG isn't good for everything. Don't use it for:
- Creative tasks (marketing copy, brainstorming) — factuality isn't the point
- Consistent-style output (e.g. brand voice) — that needs fine-tuning
- Real-time aggregation ("how much revenue did we make today?") — needs DB query / SQL agent
- Complex multi-step workflows — needs an agent, not RAG
- Structured data (tables, lists) — text-to-SQL is better
RAG works best in the sweet spot where:
- You have well-structured textual documents
- The user asks (doesn't create)
- The answer needs to be source-backed
Summary: 8 takeaways
- Naive RAG is fine for a demo, not for production — chunking, hybrid search, re-ranking and eval are all required.
- Chunking is the foundation — cut on semantic boundaries, not fixed character counts. 400-600 tokens is the default.
- Metadata > content — citation, filtering, permissions, freshness all hinge on it.
- Hybrid search (vector + BM25) + re-ranker = ~85% top-5 relevance. That's the bar.
- Query rewriting is mandatory for context-aware conversations — without it, retrieval dies by message 3.
- The prompt is a contract — permit "no information", require citations, validate source IDs.
- An eval suite is mandatory — 20-100 questions, Recall@K, LLM-as-a-judge. Without it, you're blind.
- Multi-tenant isolation at the DB level, not the prompt level. This isn't optimization — it's security.
A good RAG system isn't clever — it's disciplined. Measure every layer, log every error, gate every change behind eval. Do that and an 80%-error demo RAG becomes a 95-98% production RAG.
The last 2-5% is always human-in-the-loop. Accept that.
Have a demo RAG, need production?
In a RAG audit we review your chunking strategy, retrieval pipeline, eval coverage and multi-tenant isolation — before it goes live.
Request a RAG audit