A dashboard can tell you the AI system got slower.

It can tell you tokens went up. It can tell you errors spiked. It can tell you a model endpoint had a bad afternoon.

Useful, yes.

But when a customer asks why the system gave a bad answer, a graph is not enough. The team needs the run.

LLM observability replayable trail

The unit of debugging is the answer

In ordinary backend work, we debug requests.

For LLM systems, the request is only the outer shell. The thing you need to inspect is the answer path:

  • what was the user trying to do?
  • which prompt version ran?
  • what context was retrieved?
  • which sources were ignored?
  • which tools were called?
  • what did the model see before it answered?
  • which eval or guardrail fired?
  • did a human approve, edit, or reject it?

Without that trail, every incident becomes a guessing exercise.

Maybe retrieval pulled the wrong document. Maybe the prompt made refusal too weak. Maybe the model followed a malicious instruction inside a page. Maybe the tool returned stale data. Maybe the answer was fine, but the UI removed the citation that made it defensible.

You will not know from a p95 chart.

Tracing is not only for agent nerds

Tracing can sound like an agent-framework feature. It is bigger than that.

OpenAI’s Agents SDK has tracing concepts for agent runs. LangSmith is built around traces, runs, datasets, and evals. The specific tool matters less than the principle: a production AI system should leave enough evidence to reconstruct what happened.

That evidence needs to include the boring details:

EvidenceWhy it matters
prompt versiona copy change can change behavior
model versionproviders change models and routing
retrieved chunksbad context often creates bad answers
source IDsreviewers need to verify claims
tool inputs and outputsaction requires accountability
eval resultsquality needs a recorded signal
human decisionreview is part of the system

The goal is not surveillance theatre. The goal is operational memory.

Redaction has to be designed in

There is a real trap here. The more useful the trace, the more likely it is to contain sensitive data.

That is especially true in financial services, healthcare, legal workflows, HR, and enterprise support. Prompts may include customer details. Retrieved documents may include internal policy. Tool calls may include account identifiers. A raw trace can become a data leak with a nice UI.

So observability needs a data policy, not just an SDK.

Decide what gets stored, what gets hashed, what gets redacted, who can see raw traces, and how long they live. If the AI feature touches regulated data, this is not a later cleanup task. It is part of the launch design.

Evals belong beside traces

A trace tells you what happened.

An eval tells you whether the system did something acceptable.

For production work, I like evals that sit close to the run:

  • did the answer cite an approved source?
  • did it avoid restricted claims?
  • did it ask for clarification when required?
  • did it refuse a prompt-injection attempt?
  • did it call only approved tools?
  • did it stay within the task boundary?

These checks do not replace human review. They make review less blind.

The best eval cases often come from ugly production moments. A bad answer, a confusing support ticket, a prompt injection, a hallucinated policy, a tool call that should have required approval. Turn those into replayable tests.

That is how the system learns from pain.

Observability changes team behavior

Good observability does more than help after incidents. It changes how people build.

When engineers know the trace will show retrieved sources, they take retrieval quality more seriously. When tool calls are visible, permission design gets less hand-wavy. When prompt versions are recorded, prompt edits stop being invisible production changes. When eval failures are attached to releases, the team has a reason to discuss quality before users do.

That is the culture shift I want.

AI systems should not be trusted because they sound confident. They should earn trust by leaving evidence.

A simple observability bar

For a serious LLM feature, I would not ship without this:

  1. prompt and model version recorded
  2. retrieval source IDs recorded
  3. tool calls recorded with inputs, outputs, and approval state
  4. eval checks attached to the run
  5. human edits or overrides recorded where relevant
  6. sensitive fields redacted by design
  7. a way to replay representative failures
  8. a weekly review of sampled traces during rollout

That is enough to start.

If the system cannot leave this trail, keep the blast radius small. Use it internally. Keep a human in the loop. Do not pretend the dashboard is an operating model.

A chart tells you something moved.

A replayable trail tells you what the system did.

Sources worth reading