Diagram showing a bad Claude Code run becoming a replay case, an eval, a control change, and a safer next run

Claude Code evals should start with the run that scared you

The best Claude Code eval is not a tidy benchmark. It is the uncomfortable run your team does not want to repeat, captured as a replayable production control.

May 20, 2026 · 8 min · 1604 words · Thomas De Vos
Read Claude Code evals should start with the run that scared you
Diagram showing the operating gap between an AI POC and production AI

From AI POC to production: the part teams keep skipping

The AI POC is not the hard part anymore. The hard part is turning a promising demo into a service with ownership, evals, traces, cost controls, and a rollback path.

May 12, 2026 · 7 min · 1336 words · Thomas De Vos
Read From AI POC to production: the part teams keep skipping
Diagram showing metric-only LLM observability versus a replayable production AI trace

LLM observability is not a dashboard. It is a replayable trail.

A latency chart will not explain why an AI answer was wrong. Production LLM systems need traces, sources, tool calls, prompt versions, eval results, and human decisions.

May 10, 2026 · 4 min · 812 words · Thomas De Vos
Read LLM observability is not a dashboard. It is a replayable trail.
Diagram showing a Claude Code run stopping after repeated failures and producing a review packet instead of looping blindly

Claude Code needs a stop rule before more autonomy

Claude Code gets risky when a failed run keeps retrying without a stop rule. Use failure budgets, review packets, evals, and rollback notes before giving agents more autonomy.

May 7, 2026 · 7 min · 1364 words · Thomas De Vos
Read Claude Code needs a stop rule before more autonomy
Diagram showing a Claude Code team adoption runbook with task contract, scoped permissions, review packet, evals, and rollback

Claude Code team adoption needs a seatbelt runbook

Claude Code gets risky when teams roll it out through enthusiasm instead of a runbook. Start with task contracts, scoped permissions, review packets, evals, and rollback before widening autonomy.

May 6, 2026 · 7 min · 1327 words · Thomas De Vos
Read Claude Code team adoption needs a seatbelt runbook
Claude Code evaluation loop showing capture, reduce, test, and change steps for failed agent runs

Claude Code Evals Should Start With Bad Runs

Production Claude Code evals should not begin with abstract benchmarks. Start with the agent runs that scared you, reduce them into replayable cases, and use them to tune permissions, prompts, tools, and review gates.

April 29, 2026 · 5 min · 1020 words · Thomas De Vos
Read Claude Code Evals Should Start With Bad Runs