Back to Blog
LLMObservabilityMonitoringDevOpsAI

AI observability — LLM monitoring, tracing, debugging

ÁZ&A
Ádám Zsolt & Airon
||14 min read

In classic software, when it breaks it throws an error. In LLMs, when it breaks it confidently gives you a wrong answer. You can only catch that with observability.

Why isn't Datadog enough?

Classic software observability rests on three pillars: logs, metrics, traces. That's what the APM industry was built on (Datadog, New Relic, Dynatrace), and the questions are always the same: "where is it slow?", "where does it fail?", "where does the request break?".

In LLM-based systems those pillars stay — but they aren't enough. Because the LLM:

  • doesn't throw an exception when the answer is wrong,
  • doesn't return HTTP 500 when it hallucinates,
  • doesn't write to the log that "I wasn't confident",
  • and sometimes answers correctly, sometimes not — for the exact same prompt.

Classic monitoring happily sees that the request finished in 1.8s with 200 OK, consuming 4500 tokens. What it doesn't see is that the response was nonsensical, a hallucination, or context-lost.

That's what LLM observability is for — a new layer on top of the classic three pillars that also measures the quality of the answer and the behavior of the model. This article is about what to measure, with which tools, and what strategic decisions need to be made when you build observability for a production LLM system.


The five dimensions of LLM observability

The classic 3 pillars (logs, metrics, traces) need to be expanded to five:

Dimension What does it measure? Example tools
Logs What happened? Datadog, ELK
Metrics How much? How often? Prometheus, Grafana
Traces How did the request flow? OpenTelemetry, Jaeger
Evaluations Is the answer good? Langfuse, Helicone, Arize, Phoenix
User feedback Is it useful to the user? Built-in thumbs up/down, NPS

The last two are the real novelty — and that's exactly what's missing at most teams.


Tracing: every LLM call becomes visible

The "LLM call tree"

A modern AI system makes 20–50 LLM or tool calls under a single user request. Something like this:

User: "Book a table for 4 tomorrow evening"
  ↓ (Trace ID: abc123)
  ├─ [LLM] Query understanding (gpt-4o-mini, 320ms, 145 tokens)
  ├─ [Tool] check_user_history (DB, 25ms)
  ├─ [LLM] Plan generation (gpt-4o, 850ms, 1200 tokens)
  ├─ [Tool] search_availability (API, 180ms)
  ├─ [LLM] Reply generation (gpt-4o, 1.2s, 800 tokens)
  └─ [Tool] send_confirmation (SMS, 95ms)
Total: 2.67s, 2145 tokens, $0.034

If you log only the final answer, you have no idea where things went wrong. If you trace every step, you can see concretely: search_availability waited 5 seconds for a slow upstream API — that's why the whole request took 7 seconds, not the model being slow.

OpenTelemetry + LLM-specific attributes

Best practice: use the OpenTelemetry standard with LLM-specific attributes. That way tracing isn't a custom in-house format but readable by any vendor tooling.

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("ai-agent");

async function callLLM(messages: Message[], model: string) {
  return tracer.startActiveSpan("llm.completion", async (span) => {
    span.setAttributes({
      "llm.vendor": "openai",
      "llm.model": model,
      "llm.input.messages_count": messages.length,
      "llm.input.tokens": countTokens(messages),
      "llm.request.temperature": 0.2,
      "llm.request.max_tokens": 2000,
    });

    try {
      const response = await openai.chat.completions.create({ /* ... */ });

      span.setAttributes({
        "llm.output.tokens": response.usage.completion_tokens,
        "llm.output.total_tokens": response.usage.total_tokens,
        "llm.output.finish_reason": response.choices[0].finish_reason,
        "llm.cost.usd": calculateCost(model, response.usage),
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (e) {
      span.recordException(e);
      span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
      throw e;
    } finally {
      span.end();
    }
  });
}

The OpenTelemetry GenAI semantic conventions that took shape in 2024–25 give you standardized attribute names. Worth following, because vendor tooling builds on top of them — and that way you don't get locked into a proprietary format.


The 12 KPIs you need to track continuously

You don't have to introduce all of them at once — but over the long run these are the numbers that tell you whether the system is alive or dying.

Performance KPIs

1. Latency (P50, P95, P99) — per LLM call and per end-to-end user request. The P95–P50 gap shows you whether "there's a very bad moment".

2. Time to First Token (TTFT) — critical for streaming. Under 600 ms feels "instant", over 1.5s already feels "slow".

3. Tokens per second (TPS) — the speed of generation after TTFT. 50+ TPS is a comfortable reading experience.

Cost KPIs

4. Cost per user request — average and P95, with a weekly trend. It tends to creep up silently.

5. Cost per business outcome — e.g. "cost per successful booking", "cost per qualified lead". Much more informative than per-token cost: this is the one the CFO actually understands.

6. Token usage breakdown — input vs. output, cached vs. uncached (use prompt caching!), per model (how much goes to gpt-4o vs. gpt-4o-mini).

Quality KPIs

7. Faithfulness (for RAG) — does the answer rely only on the context, or does it "make things up"? Measured by LLM-as-a-judge.

8. Hallucination rate — in how many answers do we see unsupported claims? Measurable via citation validation.

9. Schema/format compliance — what percentage of structured output is valid? Target: 99%+.

10. Tool call success rate — what percentage of tool calls succeed? Failures split into: bad parameters (LLM's fault) vs. real errors (tool's fault).

User KPIs

11. User feedback rate ( / ) — what percentage of users give explicit feedback? A low rate (1–3%) is not bad — but the downvotes are a gold mine, because those are exactly the moments where something hurt the user.

12. Task completion rate — how many sessions end successfully (booking, purchase, resolved question)? This is the ultimate business KPI; everything else is just a proxy.


Tool choice: build, buy, or hybrid?

You don't have to build from scratch — the market is saturated and there's something for everyone.

Platform Strength When?
Langfuse Open source, self-hostable, solid tracing Privacy-sensitive, EU data sovereignty
Helicone Simple proxy-based, fast integration Quick win, small team
Arize Phoenix Open source, ML and LLM together Mixed ML + LLM stack
LangSmith LangChain-native, deep evaluation If you're on LangChain
Datadog LLM Observability Unified APM + LLM If you're already on Datadog
Honeycomb / Grafana + OTel Standard OTel-based If you want a custom solution

The strategic lines:

  • Buy (SaaS, e.g. LangSmith): fast start, little DevOps. It has a price tag, but you're buying time.
  • Self-host open source (Langfuse, Phoenix): data sovereignty and cost control. In exchange, someone has to maintain it.
  • Build on OTel: if you already have an observability stack and only need the LLM layer. Worth it if the team has OTel experience.

The worst decision is "we'll do it later". Six months in you'll be missing 100,000 traces, and after the fact nobody will be able to reconstruct why Q1 customer experience went south.


Online vs. offline evaluation

Two things that often get confused:

Offline eval runs during development. Fixed dataset (50–500 question-answer pairs), runs in CI/CD, detects regressions. The question: "is the new prompt better than the old one?"

Online eval runs in production. On live traffic, real-time, on a sampling basis (every call would be too expensive). It detects drift. The question: "is live answer quality holding up?"

The two don't substitute for each other — they complement. Offline eval catches pre-deploy regressions; online eval catches post-deploy drift (new user behavior, new edge cases, the vendor silently updating models).

LLM-as-a-judge in production

// 1% sampling rate — from production
if (Math.random() < 0.01) {
  const judgeScore = await tracer.startActiveSpan(
    "llm.eval.judge",
    async () => {
      return await llm.generate({
        model: "gpt-4o-mini", // cheaper judge
        messages: [{
          role: "system",
          content: `Rate the answer on a 1-5 scale:
            - relevance
            - faithfulness (does it stick to the source?)
            - completeness
            Respond in JSON.`
        }, {
          role: "user",
          content: JSON.stringify({ query, context, response })
        }],
        response_format: { type: "json_object" }
      });
    }
  );

  await metrics.recordEvalScore(traceId, judgeScore);
}

A caveat: LLM-as-a-judge isn't perfect either — about 80–90% correlation with human raters. It has biases and tends to "forgive itself" a bit. Periodically calibrate it against a human reviewer sample before you treat it as ground truth.


Debugging — when something falls apart

Three scenarios straight from production:

"Quality suddenly dropped"

Could be a silent vendor model update (OpenAI often updates this way). Could be new user behavior (a new prompt pattern you weren't ready for). And it can be RAG corpus drift — new documents got ingested and they're crowding out the older ones in top-K retrieval.

Debugging path: review traces from the degradation window → diff old (good) and new (bad) answers → re-run the eval suite (does it still pass?) → check the vendor changelog.

"One particular user or question always fails"

An edge case that wasn't in the eval suite. Could be a specific prompt injection or unusual input — more rarely a permission/context leak.

Debugging path: reproduce with the same session ID → review the full trace (which step failed?) → replay prompt and context offline → expand the eval suite with this case so it doesn't recur.

"Cost exploded"

New user behavior with bigger token needs; a failing retry loop (broken request restarting multiple times); a new feature using larger context; or simply the cache hit rate dropped.

Debugging path: review the top-1% most expensive traces → token trend per endpoint → retry counter (any infinite loop?) → measure cache hit rate.


Strategic decisions — what should you pick?

What should we log?

Logging everything is expensive and privacy-sensitive. Logging nothing is a dead end. A recipe that works in practice:

  • Always log the metadata: trace ID, user ID, latency, cost, model.
  • Sample (1–10%) the full prompt + response.
  • Never log raw: PII, card numbers, passwords — redact first.

Where do we store it?

  • Hot storage (last 30 days): fast queries, debugging.
  • Cold storage (1 year): audit, compliance, retrospective analysis.
  • Discard after a year: GDPR and cost.

Langfuse and Phoenix support tiered storage out of the box — no need to roll your own.

Who can see it?

  • Engineering: full access to tracing, debugging.
  • Product: aggregated KPIs, user feedback.
  • Compliance: audit log, PII leak detection.
  • Customer support: only the relevant user's sessions.

Role-based access isn't cosmetic. AI traces often contain sensitive content (what the user asked) — an opaque engineering tool eventually becomes a GDPR incident.

When should the alert fire?

Classic alerting: error rate, latency. LLM-specific alerts look at different things:

Alert Threshold Action
Hallucination rate spiked > 5% (1h average) On-call engineer
Cost per request +50% vs. 7 days ago Finance + eng review
Schema compliance < 95% sustained Bad prompt deploy suspected
Negative user feedback rate > 10% last 24h Product review
Vendor model latency 2x sustained Vendor incident suspected

The roadmap — how to build it up gradually

You don't have to roll out the entire stack on day one. A realistic timeline:

Month 1: tracing foundations

  • Pick an LLM observability vendor or self-host Langfuse
  • Trace every LLM / tool call (OTel)
  • Basic dashboard: latency, cost, error rate

Month 2: evaluation foundations

  • Offline eval suite (50–100 questions)
  • CI/CD integration — runs on every prompt change
  • Online sampling-based LLM-as-a-judge

Month 3: user feedback + KPIs

  • A feedback button on every answer
  • Define task completion KPIs
  • Cost per business outcome report

Months 4–6: alerting + drift detection

  • Alerting on the critical KPIs
  • Weekly review meeting on the top-1% worst sessions
  • Weekly eval trend

Month 6+: continuous improvement

  • Categorize "failed" sessions
  • Expand the eval suite from real failures
  • A/B testing infrastructure (prompt versions, model versions)

The most common mistakes — worth avoiding

"Observability later." Logs can't be reconstructed retroactively. You need it from day one — same as tests.

Only technical KPIs. A 200 ms latency is great if the answer is wrong. Measure quality too, otherwise you'll have a beautifully shiny dashboard while your users walk away.

Too much PII in the logs. A GDPR audit is embarrassing when the logs contain 100,000 customer transactions in the clear. Redact input and output before you log them.

LLM-as-a-judge taken naively. The judge can be wrong too. Calibrate against a human sample; don't treat it as ground truth — otherwise the "eval score" becomes a nice, reassuring, but lying number.

No user feedback. Thumbs up/down is the cheapest and best signal. One button. Don't skip it.

Alert fatigue. Too many alerts → nobody reacts. Prioritization and noise reduction are required, otherwise the on-call engineer will mute the whole thing within a week.


Summary: 7 takeaways

  1. Classic observability isn't enough — the LLM doesn't throw errors when it answers badly. A separate layer is needed.

  2. Five dimensions: logs, metrics, traces + evaluations + user feedback. The last two are the novelty.

  3. OpenTelemetry + GenAI conventions — standards-based tracing compatible with every tool.

  4. Twelve KPIs in four groups: performance, cost, quality, user. Don't just measure latency.

  5. Offline + online eval together: regression testing in CI/CD, drift detection in production.

  6. Vendor choice: Langfuse (self-host), LangSmith (LangChain), Datadog (if you're already there). Don't build from scratch.

  7. Required from day one — you can't reconstruct it later. Every deferred observability decision is more work, not less.

Good LLM observability makes the invisible visible: hallucination, context loss, cost spikes, quality drift. Without it you're flying blind — and you only find out something's wrong when a customer complains.

A mature AI team spends more on observability than on LLM calls themselves. It's worth it. Because an unmeasured system degrades gradually, without you noticing — and once it's built up, reversing that degradation costs multiples of what observability would have.


Want to see what an LLM observability stack looks like inside AIMY? Get in touch — we'll walk you through the live dashboard and the tracing patterns.