Before you merge Claude Code’s work, ask for the receipt

Passing tests are a useful signal. They are not the receipt.

That difference matters once Claude Code starts working on real repositories. A run can pass the local test suite and still leave the reviewer with the wrong question: “Do I trust this?”

The answer should not depend on vibes. It should depend on evidence.

For production work, I want every serious Claude Code run to end with a short review packet. Nothing fancy. Just enough for a human to see the shape of the work without replaying the whole session from memory.

Claude Code review packet evidence diagram

A good packet answers these questions:

  • What was the task contract?
  • Which files and tools were in scope?
  • What changed?
  • What evidence supports the change?
  • Where did the agent hit boundary pressure?
  • What risks remain?
  • What is the rollback path?

Most teams already do a weaker version of this in pull request descriptions. The problem is timing. If the packet is written after the fact, it can become a neat story about a messy run. The better approach is to tell Claude Code at the start that the review packet is part of the deliverable.

That changes the run.

The agent has to keep track of evidence as it works. It has to admit when the fix depends on a file it was not supposed to touch. It has to separate “I changed this” from “I assume this is safe”. It has to leave the reviewer with something testable.

The diff can look cleaner than the run

This is where agent risk often hides.

The final diff might look reasonable. The path to the diff might not. Maybe Claude Code changed tests to fit the implementation. Maybe it inspected config it did not need. Maybe it found an auth edge case and quietly avoided mentioning it. Maybe it fixed the visible bug while moving the failure into a queue, job, billing path, or migration.

None of those problems always show up in the patch.

They show up in the receipt.

For human code review, we often assume the reviewer can infer the story from the diff, the commit message, and the tests. That assumption was already weak for complicated systems. With coding agents it gets worse, because the work may include tool calls, retries, hidden context, generated tests, MCP access, and dead-end attempts that never appear in the final patch.

A production reviewer should not have to guess which of those happened.

The review packet template

Here is the version I would start with.

Task contract:
What Claude Code was asked to do.

Allowed scope:
Files, directories, commands, tools, and data sources it could use.

Files inspected:
What it read and why.

Files changed:
What it edited and the reason for each change.

Commands run:
Tests, linters, builds, searches, scripts, and any failures.

Evidence:
The concrete proof that the change works.

Boundary pressure:
Places where the task appeared to touch auth, billing, data, MCP tools,
infra, migrations, deployment, or another forbidden area.

Remaining risk:
What the agent could not prove.

Rollback note:
How to undo the change quickly if production disagrees with the tests.

This looks plain because it should. Production controls do not need theatre. They need to survive Tuesday afternoon when the team is busy and the patch looks fine at first glance.

The two fields I care about most are boundary pressure and rollback.

Boundary pressure tells you where the agent wanted to cross the line. Rollback tells you whether the team has a way back if the change is wrong. If either field is empty on a non-trivial change, I would slow down.

Ask for the packet before the run starts

Do not wait until the pull request is ready.

If you ask Claude Code for a summary after the work is done, it will usually give you a clean narrative. Sometimes that is useful. Sometimes it is too clean. The rough edges have already been sanded off.

Ask before the run starts instead:

At the end of this task, produce a review packet with:
- task contract
- files inspected and changed
- commands run
- evidence
- boundary pressure
- remaining risks
- rollback note

Collect the evidence while you work. If you need to cross the allowed scope,
stop and explain the pressure before continuing.

That one instruction changes the incentives. Claude Code now has to notice the boundary while it is working, not after it has already found a route around it.

This is especially important with MCP tools. A tool that reads a ticket is not the same risk as a tool that queries a live system. A docs search is not the same as access to incident notes. A local test run is not the same as a deployment command. The model may treat them all as steps toward the answer. Your workflow should not.

What reviewers should look for

A review packet is only useful if people read it with a little suspicion.

I would train reviewers to ask:

  • Did the run stay inside the agreed scope?
  • Did Claude Code inspect files for named reasons, or just gather context broadly?
  • Did tests prove behavior, or did they bless the patch?
  • Were any tests changed? If so, why?
  • Did MCP or external tool access widen the blast radius?
  • Did the agent report a boundary it should have stopped at?
  • Is the rollback path realistic, or is it just “revert the PR” pasted into a field?

The rollback question is worth extra attention. In small changes, reverting the PR may be enough. In production systems it often is not. A migration may need a down path. A config change may need a staged rollback. A queue or billing change may need a data repair note. A feature flag may need an owner.

If Claude Code cannot explain the rollback path, the reviewer has found useful uncertainty.

A small example

Suppose the task is to fix an invoice export bug.

A weak prompt says:

Fix the failing invoice export test.

A production prompt says:

Fix the failing invoice export test.

Allowed reads: invoice export module, invoice export tests, ticket PAY-1842.
Allowed edits: invoice export module and tests only.
Allowed commands: npm test -- invoice-export.
Allowed tools: read-only ticket lookup.
Stop before touching auth, billing config, customer data, migrations,
deployment, or other services.
End with a review packet: scope, files inspected, files changed,
commands run, evidence, boundary pressure, remaining risk, rollback note.

The second prompt is less exciting. Good.

If the agent discovers the bug may depend on billing config, it should stop and say so. That is not failure. That is the workflow doing its job. The task has changed, so a human needs to decide whether to widen the scope.

This is the same discipline I learned from years of financial-services engineering: the dangerous changes are not always the biggest diffs. They are often the small changes that quietly cross a control boundary.

The review packet is not bureaucracy

I can already hear the objection: this sounds like process.

It is process. The useful kind.

The goal is not to slow every Claude Code run to a crawl. Tiny local changes do not need a ceremony. But when the agent touches production-adjacent code, calls tools, edits tests, changes behavior, or works near customer impact, the team needs a receipt.

Otherwise “review” becomes a thin layer over trust. The reviewer sees a green test run and a plausible explanation, then approves a change whose actual route is unclear.

That is fine for demos. It is not enough for teams that want to use Claude Code every week without creating a mess they cannot reconstruct.

Make the receipt part of the operating loop

A useful Claude Code workflow has more shape than prompt, patch, merge.

It is:

task contract -> bounded run -> evidence -> review packet -> human decision -> rollback-ready merge

That loop is slower than blind automation. It is also much faster than explaining a production incident from a chat transcript, half a diff, and a test suite that no longer proves what you thought it proved.

If Claude Code is already touching real code in your team, start by tightening the evidence you expect from each run. The free Claude Code production checklist covers the broader control loop: permissions, tool access, evals, observability, review, and rollback.

For the full operating model, I wrote the Claude Code book around this exact problem: moving from impressive individual runs to a system a team can trust.

Working with Claude Code in production?
Get the book and the checklist before you widen the agent's permissions. Start with the free production checklist, or go straight to the Claude Code book. The landing page points to the Amazon Kindle edition as the main purchase option.