Claude Code evaluation loop showing capture, reduce, test, and change steps for failed agent runs

Claude Code Evals Should Start With Bad Runs

Production Claude Code evals should not begin with abstract benchmarks. Start with the agent runs that scared you, reduce them into replayable cases, and use them to tune permissions, prompts, tools, and review gates.

April 29, 2026 · 5 min · 983 words · Thomas De Vos
Read Claude Code Evals Should Start With Bad Runs