RAG in Practice — Production-Grade Retrieval Augmented Generation

A RAG demo takes 30 minutes. A production RAG system takes 3 months. The difference: chunking, re-ranking, evaluation and monitoring.

Intro: why "the LangChain quickstart" isn't enough

If you've ever built a RAG system, you know the script:

You download the LangChain (or LlamaIndex) quickstart
You ingest 10 PDFs into a vector DB
You ask a question → you get an answer → it works
You demo it to leadership → they like it
You hook up 10,000 documents → it all falls apart

The problem: quickstart RAG works on a toy problem. Production RAG is a different beast: irrelevant chunks, long-context degradation, multi-tenant isolation, refresh cycles, monitoring, cost explosion.

This piece is about building a production RAG system — with concrete code, measurable best practices, and the mistakes we (and others) have already made.

Layers of the RAG architecture

Naive RAG looks like this:

Question → Embedding → Vector search → Top-K → LLM → Answer

Production RAG looks like this:

Question
  ↓
[Query rewriting / decomposition]
  ↓
[Hybrid search: vector + BM25]
  ↓
[Re-ranking (cross-encoder)]
  ↓
[Context assembly + deduplication]
  ↓
[LLM with structured prompt + citations]
  ↓
[Response validation + citation check]
  ↓
[Logging + evaluation]

Every layer is skippable — but every skip degrades quality. The question isn't whether you need it, but at which phase you introduce it.

Ingestion — the most underestimated part

Chunking — the most important decision

Chunk size decides system quality. Too small → context lost. Too large → relevance diluted.

Empirical sizing:

Use case	Chunk size	Overlap
FAQ / short policy	200-300 tokens	30 tokens
Technical documentation	400-600 tokens	50 tokens
Long prose	600-800 tokens	100 tokens
Code	semantic (function / class)	0

Bad chunking: fixed character count (e.g. 1000 chars). It cuts sentences and sections in half.

Good chunking: cuts on semantic boundaries (paragraph, section, header).

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 50,
  separators: [
    "\n## ",      // markdown header
    "\n### ",     // markdown subheader
    "\n\n",       // paragraph
    "\n",         // line
    ". ",         // sentence
    " ",          // word (last resort)
  ],
});

const chunks = await splitter.splitDocuments(documents);

Metadata — the secret weapon

A chunk on its own is context-free. If the vector DB only stores content, the model doesn't know where it came from, when, who owns it.

Minimum metadata:

type ChunkMetadata = {
  documentId: string;
  documentTitle: string;
  source: string;          // URL or filename
  sectionPath: string[];   // ["Chapter 3", "3.2 Pricing"]
  pageNumber?: number;
  createdAt: Date;
  updatedAt: Date;
  tenantId: string;        // MANDATORY for multi-tenant
  permissions?: string[];  // who can see it?
  language: string;
  documentType: "policy" | "faq" | "contract" | "email" | "other";
};

Metadata enables:

Pre-search filtering (e.g. only the current tenant's documents)
Source citation in the answer
Freshness filtering (e.g. last 6 months only)
Permission checks (who's allowed to see it)

Embedding model selection

Model	Dimensions	Cost (1M tokens)	When?
`text-embedding-3-small`	1536	$0.02	Default, good price / value
`text-embedding-3-large`	3072	$0.13	When you need high precision
`bge-m3` (open source)	1024	self-host	Multilingual, on-prem
`voyage-3`	1024	$0.06	Specific domains (code, legal)

For non-English content: multilingual models (text-embedding-3-*, bge-m3) work well. Avoid English-only models (all-MiniLM-L6-v2 is ~30% worse on non-English text).

Important: if you switch models, you have to re-embed the entire corpus. Plan for it.

Retrieval — more than cosine similarity

Hybrid search: vector + keyword

Pure vector search is bad when:

Exact phrasing matters (product code, error code, name)
There are rare technical terms (the model doesn't know them well)
The question is short and specific

Solution: hybrid search = vector + BM25 (keyword). Combine results with Reciprocal Rank Fusion (RRF):

function reciprocalRankFusion(
  results: { id: string; rank: number }[][],
  k = 60
): { id: string; score: number }[] {
  const scores = new Map<string, number>();

  for (const resultList of results) {
    for (const item of resultList) {
      const current = scores.get(item.id) || 0;
      scores.set(item.id, current + 1 / (k + item.rank));
    }
  }

  return [...scores.entries()]
    .map(([id, score]) => ({ id, score }))
    .sort((a, b) => b.score - a.score);
}

// Usage:
const vectorResults = await vectorDb.search(queryEmbedding, { topK: 20 });
const keywordResults = await elasticsearch.search(query, { size: 20 });

const merged = reciprocalRankFusion([
  vectorResults.map((r, i) => ({ id: r.id, rank: i })),
  keywordResults.map((r, i) => ({ id: r.id, rank: i })),
]);

The pgvector + tsvector combination solves it inside Postgres — no separate Elasticsearch needed.

Re-ranking — the quality jump

Vector search is fast but noisy. The top-20 results often contain 5-10 irrelevant chunks. The fix: a re-ranker model.

The re-ranker is a cross-encoder: it processes the question and each chunk together and outputs a relevance score. Slower (50-200ms), but significantly more accurate.

import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerank(query: string, candidates: Chunk[]): Promise<Chunk[]> {
  const response = await cohere.rerank({
    model: "rerank-multilingual-v3.0",
    query,
    documents: candidates.map((c) => c.content),
    topN: 5,
  });

  return response.results.map((r) => candidates[r.index]);
}

// Pipeline:
const top20 = await hybridSearch(query);          // fast, noisy
const top5 = await rerank(query, top20);          // slow, accurate
const answer = await llm(query, top5);            // only top-5 reaches the LLM

Measured experience: without re-ranking, top-5 relevance ratio is ~60%; with re-ranking, ~85-90%. This directly reduces hallucination.

Query rewriting

User questions are poorly formed: short, conversational, context-dependent.

"What's the status?"                      ← unsearchable
"How is the Kovács project doing now?"    ← searchable

Solution: use an LLM to rewrite the question based on the conversation history:

async function rewriteQuery(
  history: Message[],
  userQuery: string
): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini", // a cheap model is enough
    messages: [{
      role: "system",
      content: `Generate 2-3 independent search queries from the
        conversation and the user's last message.
        Each query should be self-contained, interpretable without context.
        Return JSON array: {"queries": ["...", "..."]}`
    }, {
      role: "user",
      content: `History:\n${formatHistory(history)}\n\nNew message: ${userQuery}`
    }],
    response_format: { type: "json_object" }
  });

  return JSON.parse(response.choices[0].message.content).queries;
}

// Multi-query retrieval
const queries = await rewriteQuery(history, userInput);
const allResults = await Promise.all(queries.map(q => hybridSearch(q)));
const merged = reciprocalRankFusion(
  allResults.map(r => r.map((c, i) => ({ id: c.id, rank: i })))
);

Generation — the prompt as a contract

You can't just glue retrieved chunks to the end of the prompt. The prompt is a contract with the model.

A good RAG system prompt

const RAG_SYSTEM_PROMPT = `You are an expert assistant.
Answer ONLY based on the provided SOURCES. Rules:

1. If the sources don't contain the answer, say:
   "I have no information on this in the documentation."
   Do NOT make up an answer.

2. For every factual claim, attach a source citation: [Source: <id>]

3. If sources contradict each other, flag it:
   "The sources disagree: [Source A] says X,
    while [Source B] says Y."

4. Do NOT rely on your training knowledge. Only on sources.

5. If the question needs clarification, ask back.`;

const userPrompt = `SOURCES:
${chunks.map((c, i) => `
[Source ${c.id} | ${c.source} | ${c.sectionPath.join(" > ")}]
${c.content}
`).join("\n---\n")}

QUESTION: ${userQuery}`;

Concrete tricks:

Put sources first, question last (defends against lost-in-the-middle)
Source IDs should be machine-parseable (e.g. [Source abc123])
Explicitly permit the "no information" answer — otherwise it'll hallucinate

Citation validation

The model can cite a non-existent source. Validate:

async function validateAndStripCitations(
  answer: string,
  validSourceIds: Set<string>
): Promise<{ answer: string; warnings: string[] }> {
  const citationRegex = /\[Source:\s*([a-zA-Z0-9_-]+)\]/g;
  const warnings: string[] = [];

  const cleaned = answer.replace(citationRegex, (match, id) => {
    if (!validSourceIds.has(id)) {
      warnings.push(`Hallucinated source ID: ${id}`);
      return ""; // strip or retry
    }
    return match;
  });

  return { answer: cleaned, warnings };
}

If warnings are frequent → retry generation with a stricter prompt.

Most teams don't measure RAG quality. "It works" — until a customer complains. Then they don't know why it broke.

The eval dataset

Build 20-100 question-answer pairs by hand, categorized:

type EvalCase = {
  id: string;
  question: string;
  expectedAnswer: string;
  expectedSources: string[];     // which chunks should come back
  category: "factual" | "comparative" | "procedural" | "edge_case";
  difficulty: "easy" | "medium" | "hard";
};

Retrieval metrics

function calculateRecallAtK(
  retrieved: string[],
  expected: string[],
  k: number
): number {
  const top = new Set(retrieved.slice(0, k));
  const hits = expected.filter(id => top.has(id)).length;
  return hits / expected.length;
}

// Per-case measurement:
for (const testCase of evalDataset) {
  const retrieved = await hybridSearch(testCase.question);
  const recall5 = calculateRecallAtK(
    retrieved.map(r => r.id),
    testCase.expectedSources,
    5
  );
  console.log(`${testCase.id}: Recall@5 = ${recall5}`);
}

Target: Recall@5 > 0.85. Below that, the issue isn't generation — retrieval isn't finding relevant content.

Generation metrics — LLM-as-a-judge

You evaluate answer quality with an LLM:

async function evaluateAnswer(
  question: string,
  expectedAnswer: string,
  actualAnswer: string,
  sources: string[]
): Promise<EvalScore> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{
      role: "system",
      content: `Score the answer on (1-5):
        - faithfulness: relies only on sources?
        - relevance: answers the question?
        - completeness: contains all needed info?
        - correctness: matches expected answer?
        JSON: {faithfulness, relevance, completeness, correctness, reasoning}`
    }, {
      role: "user",
      content: `Question: ${question}
        Expected: ${expectedAnswer}
        Actual: ${actualAnswer}
        Sources: ${sources.join("\n")}`
    }],
    response_format: { type: "json_object" },
    temperature: 0
  });

  return JSON.parse(response.choices[0].message.content);
}

Industry tools: Ragas, TruLens, DeepEval. Don't roll your own unless you have to.

CI/CD integration

Run every RAG change (new chunking, new embedder, new prompt) through the eval suite:

npm run rag:eval
# Recall@5: 0.87 (was 0.84) ✓
# Faithfulness: 4.6/5 (was 4.5) ✓
# Cost per query: $0.012 (was $0.011) ⚠

Regressions surface immediately — not 2 weeks later in a customer complaint.

Production gotchas

Stale data

The vector DB doesn't refresh itself. You need:

Webhook from the source system (CMS, Confluence, Drive) → re-embed
Scheduled re-index (daily / weekly)
Soft delete: drop old chunks, don't just overwrite

Multi-tenant isolation

Never let tenant A query tenant B's data:

// BAD: filter only at prompt level
const chunks = await vectorDb.search(query, { topK: 10 });
const filtered = chunks.filter(c => c.tenantId === userTenantId);

// GOOD: filter at query level (in the DB)
const chunks = await vectorDb.search(query, {
  topK: 10,
  filter: { tenantId: userTenantId }
});

The "filter after prompt" approach is a data-leak risk — and if the top-K all belong to a different tenant, you get an empty answer and don't know why.

Cost monitoring

In RAG, costs run away easily:

Embedding (one-off, but in bulk)
Vector DB hosting
LLM calls (multiplied by context size!)
Re-ranker calls

Rule of thumb: log token counts for every user query. Review the most expensive 1% of queries weekly — that's where 80% of the spend lives.

Latency budget

Step	Typical latency
Query embedding	50-100ms
Vector search (10K chunks)	20-50ms
Vector search (10M chunks)	100-300ms
BM25	30-100ms
Re-ranking (top-20)	100-300ms
LLM generation (streamed)	500ms - 3s

Total: 1-4s P95. If that doesn't fit, stream (SSE) to the frontend.

When NOT to use RAG

RAG isn't good for everything. Don't use it for:

Creative tasks (marketing copy, brainstorming) — factuality isn't the point
Consistent-style output (e.g. brand voice) — that needs fine-tuning
Real-time aggregation ("how much revenue did we make today?") — needs DB query / SQL agent
Complex multi-step workflows — needs an agent, not RAG
Structured data (tables, lists) — text-to-SQL is better

RAG works best in the sweet spot where:

You have well-structured textual documents
The user asks (doesn't create)
The answer needs to be source-backed

Summary: 8 takeaways

Naive RAG is fine for a demo, not for production — chunking, hybrid search, re-ranking and eval are all required.
Chunking is the foundation — cut on semantic boundaries, not fixed character counts. 400-600 tokens is the default.
Metadata > content — citation, filtering, permissions, freshness all hinge on it.
Hybrid search (vector + BM25) + re-ranker = ~85% top-5 relevance. That's the bar.
Query rewriting is mandatory for context-aware conversations — without it, retrieval dies by message 3.
The prompt is a contract — permit "no information", require citations, validate source IDs.
An eval suite is mandatory — 20-100 questions, Recall@K, LLM-as-a-judge. Without it, you're blind.
Multi-tenant isolation at the DB level, not the prompt level. This isn't optimization — it's security.

A good RAG system isn't clever — it's disciplined. Measure every layer, log every error, gate every change behind eval. Do that and an 80%-error demo RAG becomes a 95-98% production RAG.

The last 2-5% is always human-in-the-loop. Accept that.

Have a demo RAG, need production?

In a RAG audit we review your chunking strategy, retrieval pipeline, eval coverage and multi-tenant isolation — before it goes live.

Request a RAG audit