LLM observability is not a dashboard. It is a replayable trail.
A latency chart will not explain why an AI answer was wrong. Production LLM systems need traces, sources, tool calls, prompt versions, eval results, and human decisions.
Topic archive
17 essays tagged LLMOps. Practical notes on what happens after the demo: prompts, tools, review packets, evals, rollback, and production ownership.
A latency chart will not explain why an AI answer was wrong. Production LLM systems need traces, sources, tool calls, prompt versions, eval results, and human decisions.
A Claude Code diff is not enough evidence for production review. Ask for the objective, permission boundary, tool trace, tests, failures, cost, and rollback path before approving agent work.

Claude Code: Building Production Agents That Actually Scale is now live on Amazon Kindle. Here is who it is for and why I wrote it.
If a Claude Code agent can change production-shaped code, the prompt should say how to undo the work. Rollback is not paperwork after the diff. It is part of the task boundary.
Claude Code cost problems usually start before the model call: vague tasks, wide-open tools, repeated repo exploration, and no stop rule. Treat spend as a workflow bug, not just a pricing problem.
Production Claude Code evals should not begin with abstract benchmarks. Start with the agent runs that scared you, reduce them into replayable cases, and use them to tune permissions, prompts, tools, and review gates.
Claude Code permission modes can look safer than they are. The real production risk lives in tool scope: paths, network access, secrets, deploy files, and what reviewers actually approve.