Claude Code evals should start with the run that scared you
The best Claude Code eval is not a tidy benchmark. It is the uncomfortable run your team does not want to repeat, captured as a replayable production control.
Topic archive
17 essays tagged LLMOps. Practical notes on what happens after the demo: prompts, tools, review packets, evals, rollback, and production ownership.
The best Claude Code eval is not a tidy benchmark. It is the uncomfortable run your team does not want to repeat, captured as a replayable production control.
Before giving Claude Code wider access, define what each run may read, edit, call, spend, and merge. A permission budget keeps agent speed inside a reviewable boundary.
MCP tools make Claude Code far more useful, but broad access turns a weak prompt into a production risk. Treat every tool as blast radius, not convenience.
Claude Code can waste more than tokens when it keeps retrying a weak task. Production teams need budgets, stop rules, and evidence before another agent attempt is allowed.
Claude Code can produce a working patch and still leave the next human with a weak handoff. Production teams need run records that show scope, evidence, risk, and rollback before review turns into archaeology.

I published Securing Enterprise AI Agents, a practical book on bounded AI autonomy, AgentSecOps, MCP security, RAG governance, identity, evals, policy, and evidence.
Human review matters, but it cannot fix every bad Claude Code boundary after the run. Production teams need scoped permissions, MCP limits, hard stops, and evidence before widening access.
Claude Code patches can look ready before they are reviewable. Production teams need a run record with task boundaries, commands, checks, risks, and rollback notes.
Claude Code gets safer when permissions are treated like a budget: scoped files, allowed tools, spend limits, stop rules, review packets, and rollback notes before wider autonomy.
The AI POC is not the hard part anymore. The hard part is turning a promising demo into a service with ownership, evals, traces, cost controls, and a rollback path.