Back to Knowledge Base
WhitepaperHallucinationRAGLLMPrompt engineeringStructured outputTool useValidationHuman-in-the-loop

AI hallucination mitigation — why it happens and how to handle it

ÁZ&A
Ádám Zsolt & Airon
||11 min read

AI doesn't lie. It just doesn't know that it doesn't know. And that's much more dangerous.

What is hallucination, really?

AI hallucination happens when a language model (LLM) confidently states something that isn't true. It doesn't throw an error or a warning — it simply invents an answer that is grammatically perfect, stylistically convincing and factually wrong.

Real-world examples:

  • A lawyer used ChatGPT to draft a court filing → cited 6 case precedents that don't exist (Mata v. Avianca, 2023)
  • A healthcare chatbot confidently recommended an incorrect drug dosage
  • An enterprise AI assistant cited a company policy that never existed

Hallucination is not a bug — it's a natural consequence of how the model works. If you don't understand why it happens, you won't be able to manage it.

Why do models hallucinate?

Language models don't "know" — they predict

An LLM is not a knowledge database. It's a probabilistic text continuation engine: given a context, it generates the most likely next token. If you ask "Who wrote War and Peace?", it doesn't look at a list — it reconstructs from statistical patterns what the most likely answer to such a question would be.

If the data in the question appeared often in training → correct answer. If it appeared rarely or never → the model fills the gap with plausible but false data.

The four main causes

Missing or conflicting training data The model never encountered the question — or saw conflicting information (e.g. a book referenced with two different authors). It still answers — that's its job.

Temperature too high The temperature parameter controls how far the output deviates from the most likely token. High value → more creative, but more prone to hallucination.

Long context degradation The model pays less attention to the start of a 100K+ token context. The "lost in the middle" effect: long documents are often misread in the middle.

Prompt ambiguity If the question isn't clear, the model "picks" an interpretation. If it picks the wrong one → confident but irrelevant answer.

The "illusion of knowledge"

The most dangerous part: the LLM doesn't know that it doesn't know. It has no internal confidence meter for factuality. So:

  • ❌ It doesn't say "I don't know" on its own (unless trained to)
  • ❌ It doesn't signal uncertainty
  • ❌ It doesn't distinguish memorized facts from invented composites

This is a technical limit — not bad intent.

The 5 types of hallucination

Not every hallucination is the same. So the mitigation isn't either.

Type What happens? Example How to handle
Factual A specific fact is wrong "Budapest is the capital of Poland" RAG, validation
Source fabrication Invented citation Non-existent book / case Mandatory source links
Logical Wrong inference from correct data Math error Chain-of-thought, calculator tool
Instruction Doesn't do what you asked "Return JSON only" → starts with prose Structured output, Zod / Pydantic
Context drift Misquotes earlier conversation "As you said, X..." (you didn't) Shorter context, summary

Practical mitigation techniques

RAG — Retrieval Augmented Generation

The most common and most effective mitigation: don't let the model "remember" — give it sources.

How it works:

User: "What does the 2024 leave policy say about home office?"
   ↓
Vector DB query (based on embedding of the question)
   ↓
Top-5 relevant document chunks returned
   ↓
Prompt: "Based on these documents, answer: [chunks] ... Question: ..."
   ↓
LLM answer with source citation

Code example (simplified):

async function ragQuery(question: string) {
  // 1. Embed the question
  const queryEmbedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: question
  });

  // 2. Find relevant chunks
  const relevantDocs = await vectorDb.search({
    embedding: queryEmbedding.data[0].embedding,
    topK: 5,
    minScore: 0.75 // low score → don't even answer
  });

  // 3. If nothing relevant → don't hallucinate
  if (relevantDocs.length === 0) {
    return "I couldn't find an answer in the documentation.";
  }

  // 4. Structured prompt with sources
  const context = relevantDocs
    .map((d, i) => `[Source ${i+1}: ${d.source}]\n${d.content}`)
    .join("\n\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{
      role: "system",
      content: `Answer only based on the provided sources.
        If the sources don't contain the answer, say: "I have no information on this".
        After every claim, cite the source as [Source N].`
    }, {
      role: "user",
      content: `Sources:\n${context}\n\nQuestion: ${question}`
    }],
    temperature: 0.1
  });

  return response.choices[0].message.content;
}

Best practices:

  • Minimum score threshold: if top-1 relevance is below 0.75, don't answer
  • Mandatory source citation in the system prompt
  • Chunk size: 200-500 tokens works best (not too short, not too long)
  • Hybrid search: vector + keyword combined (BM25 + cosine)

Structured output — force the model into a shape

When the model has to return a concrete structure, it hallucinates much less.

Example with Zod + OpenAI structured output:

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const InvoiceSchema = z.object({
  invoiceNumber: z.string(),
  totalAmount: z.number(),
  currency: z.enum(["HUF", "EUR", "USD"]),
  items: z.array(z.object({
    description: z.string(),
    quantity: z.number(),
    unitPrice: z.number()
  })),
  // Critical: signal uncertainty
  confidence: z.enum(["high", "medium", "low"]),
  uncertainFields: z.array(z.string()).optional()
});

const response = await openai.chat.completions.parse({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "Extract the invoice data. Only what you can clearly see." },
    { role: "user", content: invoiceText }
  ],
  response_format: zodResponseFormat(InvoiceSchema, "invoice")
});

The confidence and uncertainFields fields force the model to acknowledge uncertainty. If confidence === "low" → manual review.

Chain-of-Thought and self-critique

Chain-of-Thought (CoT): ask the model to think step by step.

Bad prompt: "How many apples are left if you sell 3 of 10 twice?"
Good prompt: "Think step by step. 1) How many apples to start?
              2) How many after the first sale? 3) How many after the second?"

Self-critique: 2-step generation:

// Step 1: answer
const answer = await generate(question);

// Step 2: self-critique
const critique = await generate(`
  The following is the answer to the question:
  Question: ${question}
  Answer: ${answer}

  Examine it critically:
  1. Are there any unsupported claims?
  2. Are there logical errors?
  3. Are there fabricated facts or citations?

  Return a corrected answer with only verified information.
`);

Self-critique can cut hallucination by 30-50% — at the cost of 2x token spend.

Tool use — calculator, search, database

The model is bad at math, bad at dates, bad at real-time data. Give it tools.

const tools = [
  {
    type: "function",
    function: {
      name: "calculate",
      description: "Evaluate a math expression",
      parameters: { /* ... */ }
    }
  },
  {
    type: "function",
    function: {
      name: "search_database",
      description: "Database search for customer data",
      parameters: { /* ... */ }
    }
  },
  {
    type: "function",
    function: {
      name: "web_search",
      description: "Search for up-to-date information",
      parameters: { /* ... */ }
    }
  }
];

The model issues a tool_call → you execute it → result returns to the model. The data is real, the model only interprets.

Temperature and sampling

For factual tasks:

{
  temperature: 0.1,    // low creativity
  top_p: 0.95,         // narrow probability mass
  presence_penalty: 0,
  frequency_penalty: 0
}

For creative tasks (marketing copy, brainstorm):

{
  temperature: 0.8,
  top_p: 0.95
}

Never use high temperature for factual answers.

Detection — how do you spot a hallucination?

Automated validation

Source check: if the model cites a source, automatically verify it exists:

async function validateCitations(answer: string, sources: Source[]) {
  const citationPattern = /\[Source (\d+)\]/g;
  const citations = [...answer.matchAll(citationPattern)];

  for (const match of citations) {
    const sourceIndex = parseInt(match[1]) - 1;
    if (sourceIndex >= sources.length) {
      throw new Error(`Hallucinated citation: ${match[0]}`);
    }
  }
}

Schema validation: if you expect JSON, validate it:

try {
  const parsed = InvoiceSchema.parse(JSON.parse(response));
} catch (e) {
  // Hallucinated / invalid structure
  retry();
}

Confidence measurement

Logprobs: the logprobs parameter returns how confident the model was at each token.

const response = await openai.chat.completions.create({
  // ...
  logprobs: true,
  top_logprobs: 5
});

const avgLogprob = response.choices[0].logprobs.content
  .reduce((sum, t) => sum + t.logprob, 0) / response.choices[0].logprobs.content.length;

if (avgLogprob < -1.5) {
  // Low confidence → manual review or re-ask
}

LLM-as-a-judge

Have another LLM (or the same one in a separate call) judge the answer:

Prompt: "For the question-answer pair below:
        - Is the answer factually correct? (1-5)
        - Does it contain unsupported claims? (yes/no)
        - Are there contradictions? (yes/no)
        Respond as JSON."

Not perfect (the judge can hallucinate too), but it catches a lot.

Production checklist

Before shipping the AI feature, check:

  • Is RAG in place where factual answers are required?
  • Is a minimum relevance score configured?
  • Did you teach the system prompt to say "I don't know"?
  • Is source citation mandatory?
  • Is structured output used where structure matters?
  • Is temperature low (0.0-0.3) for factual cases?
  • Is tool use used where math, dates, real data matter?
  • Is validation running on the output (schema, citation, business logic)?
  • Monitoring: are low-confidence cases logged?
  • Human-in-the-loop on critical decisions (medicine, law, finance)?
  • Disclaimer: does the user know it was AI?

Business risk management

Beyond the technical mitigation, business decisions matter too:

Risk zones

Use case Risk Strategy
Marketing copy generation Low LLM autonomous, human review before publish
Internal customer-info chatbot Medium RAG + source citation + "uncertain → handoff to human"
Legal / medical advice High Only with a human expert, never autonomous
Financial transaction decision Critical AI suggests, human decides, audit log

The "90% accuracy" trap

If the AI is right 90% of the time, that can be excellent — or catastrophic. For a customer-service chatbot, 10% errors are tolerable. For a drug dosage suggestion, never.

The question is: what is the cost of an error?

  • If low → you can grant autonomy
  • If high → human-in-the-loop is mandatory

Summary: 7 takeaways

  1. Hallucination is not a bug — it's a natural consequence of the architecture. The model predicts, it doesn't know.
  2. 5 types: factual, source fabrication, logical, instruction, context drift. Each needs different mitigation.
  3. RAG is the most effective — give the model sources, don't let it remember. Minimum score, hybrid search, mandatory citation.
  4. Structured output — when forced into a shape, the model hallucinates less. Zod / Pydantic schema, confidence field.
  5. Tool use for math, dates, real data. Never let the model do arithmetic on its own.
  6. Temperature 0.1-0.3 for factual cases. Creativity and factuality are opposites.
  7. Human-in-the-loop for critical decisions. The 90% accuracy trap — the cost of the 10% decides.

Hallucination can't be fully eliminated — but it can be reduced to 1-2% with the right architecture. The difference between "an AI feature demo" and "an enterprise-ready AI system" is not the model, it's the validation layer built around it.

The model is a creative child. You are the responsible adult next to it.

Building a hallucination-resistant AI system?

In a 60-minute consultation we review your use case and risk level, and outline a RAG + validation architecture that's defensible in your context.

Request a consultation