The most dangerous Claude Code review is the one that starts with a neat-looking diff and no story.

A diff tells you what changed. It does not tell you why the agent chose that path, what it ignored, which tools it used, where it failed, what it retried, or whether it quietly widened the risk boundary along the way.

I do not want agent work reviewed like a magic trick: stare at the final result, applaud if the tests pass, move on.

For production-shaped work, I want a review packet.

Claude Code review packet

A clean diff is not enough

A clean diff can still be the end of a messy run.

Maybe Claude Code tried three approaches before landing on the final patch. Maybe it touched a shared helper, backed out, then left one small change behind. Maybe it loaded a file it should not have needed. Maybe it ran the broad test suite when a targeted one would have shown the real issue. Maybe the only thing a reviewer sees is the polished summary at the end.

That is not a review. That is trust by omission.

For small local work, this may be fine. If the agent is renaming a private function and the blast radius is tiny, I do not need ceremony. But when Claude Code edits production-shaped code, calls MCP tools, changes configuration, updates generated files, or moves near deploy scripts, the review needs evidence.

Not a twenty-page document. A compact packet.

What belongs in the packet

The review packet should answer the questions a tired engineer would ask before pressing merge:

FieldWhy it matters
ObjectiveWhat was the agent actually asked to do?
Permission boundaryWhich files, commands, MCP tools, network calls, and secrets were allowed or blocked?
Plan changesDid the agent change approach after seeing the code?
Tool traceWhat commands ran, which files were read, and what external tools were used?
Diff summaryWhat changed, and where is the risky part?
Tests and evalsWhat passed, what failed, what was skipped, and why?
Cost and retriesDid the run loop, burn context, or repeat the same mistake?
Rollback pathIf this is wrong in staging, how do we unwind it?

That list sounds boring because it is. Good review evidence is usually boring.

The value is not in producing paperwork. The value is that a human can make a real decision in a few minutes instead of reverse-engineering the whole run from a diff and a confident paragraph.

The permission boundary is the first thing I read

When I review agent-assisted work, I want to know the boundary before I look at the cleverness.

Read-only repo analysis is one risk category. Scoped branch edits are another. MCP calls, network access, secrets, database scripts, CI changes, and deploy commands are not the same thing with a different label. They are different operating modes.

A useful packet says something like:

Allowed:
- read repository files
- edit billing/validation/** and tests/billing/**
- run npm test -- billing-validation

Blocked:
- migrations
- deploy scripts
- payment provider adapters
- production credentials
- external MCP tools unless approved

Now the reviewer has something concrete to check. If the diff touches a migration, the run violated the boundary. If the tool trace includes an MCP call that was not approved, the issue is visible. If the agent stayed inside the envelope, that also matters. Trust should be earned by evidence, not by vibe.

This is the same production habit I wrote about in Claude Code permissions: the production mistake that bites later. Permissions are not there to make the workflow feel enterprise. They keep a useful run from spreading sideways.

Keep the failed attempts

I would rather see a failed test in the packet than have it hidden behind the final green run.

The failed attempt often tells you whether the agent understood the task. Did it fix the actual bug, or did it change the test until the build passed? Did it remove coverage? Did it retry the same broken command five times? Did it load half the repository because the prompt was too vague?

That evidence is not embarrassing. It is how the team improves the loop.

A failure trail can turn into an eval. A bad command can become a guardrail. A repeated retry can expose a prompt that needs a tighter done line. If the packet only shows the happy ending, the team loses the part that teaches the workflow.

Cost is a workflow signal

Cost belongs in the packet, even when nobody is trying to turn code review into a finance meeting.

Runaway token use is often a design smell. The prompt was too wide. The agent had too much context. The tool choice was sloppy. The task mixed investigation, implementation, and cleanup into one long wandering session.

If nobody sees the cost and retry count, nobody improves the agent loop. The team just notices the bill later and blames the model.

For a meaningful Claude Code run, I want a plain note:

Cost/retry note:
- 2 implementation attempts
- 1 failed targeted test before fix
- no repeated identical command failures
- no broad repository scan after plan approval
- stopped before touching shared retry utilities

That is enough. The point is not perfect accounting. The point is making waste visible while the review context is still fresh.

Rollback belongs in the same packet

Rollback should not be an afterthought at the end of the PR.

If Claude Code changes configuration, auth logic, database-facing code, generated clients, CI behavior, or integration boundaries, the packet should say how to back out. “Revert the PR” is sometimes enough. Often it is not.

A better rollback note says:

Rollback:
- revert this commit
- rerun billing validation tests
- check the duplicate invoice smoke case in staging
- no migration or shared provider contract changed

Now the reviewer knows the escape path. If the rollback note is complicated for a small task, that is useful too. It probably means the task was too wide and should be split before merge.

I covered this more directly in Claude Code rollback plans belong in the prompt. The short version: a run is easier to trust when the way out exists before the diff does.

Make review packets cheap

If the packet is painful to create, nobody will use it after the first enthusiastic week.

So make it part of the run, not a separate ritual. Put the expected packet fields in the prompt. Store the tool trace automatically where possible. Ask for short answers. Increase the strictness only when the risk justifies it.

For a low-risk refactor, the packet might be five lines. For a change near payment, auth, deployment, or data movement, it should be stricter. Same habit, different weight.

The rule I keep coming back to is simple: if the packet gets better over time, you can consider giving the agent more room. If the packet is thin, do not increase autonomy. Tighten the loop first.

That is the production layer around Claude Code. Not fear. Not blind trust. Evidence, boundaries, and a way back.

Want the lightweight version for your own workflow?
Start with the free Claude Code production checklist. It covers permissions, sandboxing, evals, observability, human review, cost limits, and rollback before you widen agent autonomy.

I am writing the deeper production layer in Claude Code: Building Production Agents That Actually Scale: review packets, permissions, MCP boundaries, evals, rollback, observability, and the operating loops that keep agents useful without pretending they are safe by default.

Prefer LeanPub? You can get the book here: Claude Code: Building Production Agents That Actually Scale.