Claude Code output is not evidence

Claude Code can produce a patch that looks ready before the work is actually reviewable.

That sounds like a small distinction. It is not.

Ready-looking output has a diff, a confident summary, and maybe a passing test run. Reviewable work has a trail. It shows what the agent believed the task was, what it refused to touch, which files changed, what commands ran, which checks were skipped, and where the rollback handle is.

Most agent review pain comes from mixing those two things up.

The diff answers the wrong question

A diff answers “what changed?” That matters, but it is too narrow for agent work.

With a human PR, you can ask the developer what they meant. You can ask why they ignored a path, changed a default, or chose one test over another. The answer might be messy, but there is a person there who remembers the trade-off.

An agent run is different. Once the context scrolls away, the team is left with the patch, the summary, and whatever evidence was captured during the run. If that evidence is thin, the reviewer has to reconstruct the work after the fact.

That is where the cost hides.

The agent saves time producing the patch, then the human spends the saved time figuring out whether the patch can be trusted. I have seen the same pattern in ordinary engineering work, long before coding agents: the expensive part is not always writing the change. It is recovering the reasoning when the change lands without enough context.

Claude Code makes that loop faster.

Confident summaries are weak evidence

A summary like “updated the auth flow and all tests pass” feels useful. It may even be true. It still leaves the reviewer with the hard questions:

Which auth path changed?
Which files did the agent decide not to touch?
Which tests ran, and which obvious checks were skipped?
Did the agent widen scope during the run?
Did it change a default, permission, prompt, schema, or dependency?
What assumption would hurt us if it was wrong?
How would we back this out tonight?

Those questions are not bureaucracy. They are the difference between a patch that merely compiles and a patch that can survive contact with production.

A passing test is similar. Good signal, not enough by itself. A thin test suite can go green while the agent changes a boundary nobody meant to change. A unit test can pass while the integration path is broken. A local run can pass while CI has different flags, different data, or a different model configuration.

I like green tests. I do not like pretending they explain the work.

The run record is the review artifact

For production Claude Code work, I want the run record treated as part of the deliverable.

At minimum, it should answer:

What was the task boundary?
What changed and why?
What commands ran?
Which checks passed?
Which checks were skipped?
What risks or assumptions remain?
How do we roll back?

That record can live in the PR description, a generated review packet, a CI artifact, or the agent’s final response copied into the ticket. I care less about the storage format than the habit.

No run record, no merge.

That rule sounds strict until you compare it with how teams already treat production changes. We ask humans for tickets, tests, changelogs, approvals, incident notes, and rollback plans because software changes outlive the mood of the person who made them. Agent changes should not get a lower bar because the demo looked smooth.

A practical template for Claude Code reviews

Here is the version I would put into a team runbook or PR template.

Claude Code run record

Task boundary:
- What the agent was asked to change.
- Files, directories, tools, and MCP servers allowed for this run.

Change summary:
- Files changed and why.
- Behavior changed, including defaults or permissions.

Commands and checks:
- Commands run.
- Tests passed.
- Checks skipped, with reason.

Risk note:
- Assumptions made.
- Areas not inspected.
- Anything security, data, auth, payment, or infrastructure related.

Rollback note:
- How to revert the change.
- Any migration, data, or config caveats.

The template is deliberately plain. Production agent work does not need theatre. It needs enough boring detail for a tired reviewer to make a good decision.

If your team is starting to use Claude Code on real repositories, the short version is in this checklist: Claude Code production checklist.

The rollback note changes the work

The rollback note is the part people skip.

That is exactly why it belongs in the run record.

When an agent has to explain how to reverse its change, several useful things happen. Broad edits become more obvious. Hidden coupling gets named. Database and config changes stop hiding inside a neat summary. The reviewer can see whether the agent understood the blast radius or only the happy path.

This is especially important once MCP tools enter the workflow. A coding agent that can edit files is one risk. A coding agent that can also query systems, create tickets, call internal tools, or trigger build actions has a wider operating envelope. The run record should show not only what changed in code, but what the agent did around the code.

If the rollback story is vague, I would rather narrow the next run than widen autonomy.

Use evidence to adjust autonomy

I do not think team adoption should treat autonomy as a permanent setting.

A strong run record earns trust for the workflow. Narrow task, scoped files, clear commands, meaningful checks, useful risk note, clean rollback path. Maybe that kind of run gets a little more room next time.

A weak run record should shrink the workflow. Broad diff, skipped checks, unclear assumptions, no rollback note. That is not a reason to cheer the speed of the agent. It is a reason to reduce write scope, tighten permissions, or split the task into smaller runs.

That feedback loop matters more than a clever prompt.

Production Claude Code usage should look like this:

task contract -> scoped run -> patch -> run record -> human review -> permission adjustment

The operating model is not glamorous. Good. Glamour is overrated near production systems.

The sale is discipline

There is a temptation to sell coding agents as pure acceleration. Faster tickets. Faster refactors. Faster tests. Faster everything.

Speed is useful, but speed without evidence just moves the work into review and incident response.

The teams I trust with Claude Code will be the ones that can answer boring questions quickly. What did the agent touch? What did it avoid? Which tools did it call? Which checks proved something? What should a human inspect? How do we reverse this if it goes wrong?

That is the work after the demo.

If you are trying to move from impressive Claude Code runs to production-ready team workflows, I wrote the book for that exact gap: Claude Code: From Vibe Coding to Production. It covers task contracts, run records, review packets, permissions, MCP blast radius, evals, observability, rollback, and safe autonomy.

Build reviewable Claude Code workflows.
Read the book: Claude Code: From Vibe Coding to Production on Amazon Kindle.
Want the operating checklist first? Start with the free Claude Code production checklist.

Claude Code output is not evidence#

The diff answers the wrong question#

Confident summaries are weak evidence#

The run record is the review artifact#

A practical template for Claude Code reviews#

The rollback note changes the work#

Use evidence to adjust autonomy#

The sale is discipline#

Claude Code output is not evidence

The diff answers the wrong question

Confident summaries are weak evidence

The run record is the review artifact

A practical template for Claude Code reviews

The rollback note changes the work

Use evidence to adjust autonomy

The sale is discipline