Claude Code Evals Should Start With Bad Runs
Production Claude Code evals should not begin with abstract benchmarks. Start with the agent runs that scared you, reduce them into replayable cases, and use them to tune permissions, prompts, tools, and review gates.