Claude Code green tests are not a review packet
The awkward Claude Code failure is not always red CI.
Sometimes the run looks successful. The agent edits the code, changes a test, runs the suite, and hands back a calm final message. The diff looks plausible on first pass. The tests are green. Nothing is obviously on fire.
That is when I get more careful, not less.
Green tests answer one narrow question: did the checked behaviour pass under the tests that ran? They do not tell you whether Claude Code stayed inside the task boundary. They do not tell you which assumptions changed. They do not tell you whether an MCP tool was used, whether a failure was skipped, whether a test got rewritten around the bug, or whether the rollback path exists.
For production work, I want more than a result. I want a review packet.
The green suite can hide the interesting part
I like tests. I have built enough software in financial-services environments to have a healthy fear of untested confidence.
But tests are still only a sampling mechanism. They cover the behaviours someone thought to encode. A Claude Code run can satisfy that sample and still create a mess just outside the frame.
The common version looks like this:
- You ask for a small fix.
- The agent edits the obvious file.
- A test fails.
- It changes the implementation again.
- It adjusts a fixture or assertion.
- The suite passes.
Maybe that is fine. Maybe the test was wrong. Maybe the old assertion described a bug, not a contract.
Or maybe the agent quietly moved the goalposts.
The reviewer has to know which one happened. A final line saying “all tests pass” is not enough to tell the difference.
A result is not the same as evidence
There is a big difference between these two agent summaries:
Done. Tests pass.
and:
Done.
Task contract:
- Fix duplicate refund handling in billing/refunds only.
Changed:
- billing/refunds/retry_policy.py: added idempotency check before retry.
- tests/billing/test_refunds.py: added duplicate refund case.
Commands run:
- pytest tests/billing/test_refunds.py
- pytest tests/billing/test_refunds.py::test_duplicate_refund_is_ignored
Failures seen:
- First run failed because the old fixture did not include an idempotency key.
Tests changed:
- Added a new test.
- Did not weaken existing assertions.
Tools used:
- Local file reads, local test runner.
- No MCP tools.
Rollback:
- Revert this patch.
- Rerun pytest tests/billing/test_refunds.py.
The second one is still not proof. It is much better than confidence.
It gives the reviewer handles. You can compare the promised scope with the actual diff. You can inspect the test change first. You can verify the command log. You can ask why the fixture needed an idempotency key. You can decide whether this is a narrow patch or the start of a deeper contract problem.
That is the job of a review packet: make the run replayable enough that a human can take responsibility for the merge.
The minimum packet I want from Claude Code
For serious repository work, my minimum review packet is short.
Review packet:
- original task contract
- allowed files, tools, and commands
- files changed, grouped by intent
- commands run, including failures
- tests added, changed, skipped, or unavailable
- MCP servers or external tools used
- assumptions made
- unresolved risks
- rollback note
This is not ceremony. It is a cheap way to avoid code-review archaeology.
Without it, the reviewer has to reconstruct the whole run from the diff and the chat transcript. Why did Claude Code touch this helper? Was that test already failing? Did it use a docs MCP server? Did it read a ticket, or did it infer the requirement from file names? Did it change a fixture because the fixture was wrong, or because the fixture was inconvenient?
Those questions are not nitpicks. They are the difference between “the agent produced a patch” and “the team can defend this patch”.
I wrote about the same operating habit in Claude Code review packet before approval and Claude Code needs a stop rule before more autonomy. The pattern keeps coming back because the real risk is rarely one dramatic mistake. It is the quiet loss of context between the run and the human who approves it.
Test rewrites deserve special attention
When an agent changes tests, slow down.
That does not mean test changes are bad. Sometimes the correct fix includes a new test, a better fixture, or a changed assertion because the old one encoded the wrong behaviour. I want Claude Code to improve tests when the task calls for it.
But I want the packet to say exactly what happened.
A good test section might say:
Tests:
- Added test_duplicate_refund_is_ignored.
- Updated refund_factory to include idempotency_key because production requests always include it.
- Did not remove or weaken existing assertions.
- Did not skip any tests.
A worrying one sounds like this:
Tests:
- Updated expected status from 409 to 200.
- Removed flaky assertion.
- Full suite not run due to time.
That might still be defensible, but it is not a casual merge. It needs a human decision.
This is where a lot of agent review gets fuzzy. Humans see green, skim the diff, and assume the behaviour is still protected. The packet should make the uncomfortable parts visible.
Scope drift is easier to spot when the scope is written down
Claude Code can be very good at following a trail. That is useful until the trail leaves the task.
A small frontend bug points to a shared formatter. The formatter points to a date helper. The helper points to a package update. The package update touches lockfiles. Suddenly a tiny task has changed five layers of the system.
Maybe the agent found the right root cause. Maybe it widened the blast radius because it was trying to make the local problem disappear.
The review packet should show the original boundary and the actual boundary:
Allowed scope:
- src/components/invoice-table/
- tests/components/invoice-table/
Actual changes:
- src/components/invoice-table/InvoiceDueDate.tsx
- tests/components/invoice-table/InvoiceDueDate.test.tsx
- src/lib/date-format.ts
Boundary note:
- Changed src/lib/date-format.ts because the component used the shared formatter.
- No package or lockfile changes.
Now the reviewer can make a real call. The shared helper change may be right. It may need a separate PR. It may require more tests. At least the boundary crossing is not hidden inside a confident final answer.
The same idea applies to MCP. In Claude Code MCP tools need a blast radius, not a vibe check, I argued that tool access is blast radius. If the run used an MCP server, the packet should say which one, in what mode, and what it returned. A reviewer should not have to dig through the full transcript to discover that the agent queried internal docs or tried a write-capable tool.
Rollback is part of review, not an afterthought
A change is easier to approve when the way back is boring.
For agent work, I want the rollback note before approval. Not after something breaks. Before.
A useful rollback note can be plain:
Rollback:
- Revert this commit.
- No migration was added.
- No feature flag was changed.
- No external tool state was modified.
- Rerun pytest tests/billing/test_refunds.py after revert.
If rollback is not that simple, say so:
Rollback risk:
- This changes a migration.
- Existing rows may be backfilled by the deploy job.
- Revert needs a down migration and data check.
That does not mean the work cannot ship. It means the agent has moved into a higher-risk category. The reviewer should know that before the merge button becomes tempting.
What should make you stop the review
I would pause a Claude Code review when any of these show up:
- The packet says tests pass, but does not list the commands.
- Existing tests were changed without a clear behavioural reason.
- The diff touches auth, billing, permissions, data deletion, migrations, deployment config, package updates, or infrastructure outside the original task.
- The run used MCP tools but does not include a tool log.
- The agent claims something was verified, but the command log does not show it.
- The rollback note is missing or hand-wavy.
- The final summary is more confident than the evidence.
None of these automatically mean the patch is bad. They mean the human does not yet have enough evidence to own it.
That distinction matters. The point is not to slow Claude Code down for sport. The point is to avoid fake speed, where the agent saves twenty minutes and the reviewer spends an hour reconstructing the run.
A practical prompt to use tomorrow
If your team already uses Claude Code, add this to the end of the next non-trivial task:
Before final approval, produce a review packet.
Include:
- the original task contract
- all files changed and why
- commands run, with failures included
- tests added, changed, skipped, or not run
- tools or MCP servers used
- assumptions and unresolved risks
- rollback instructions
If you changed tests, explain whether you added coverage, corrected an outdated assertion, or weakened a check. Do not hide test changes behind "all tests pass".
Then review the packet before reviewing the diff.
That small ordering change helps. It tells you what story the agent believes it has produced. The diff tells you whether the story is true.
If you want a lightweight baseline for this operating loop, use the free Claude Code production checklist. It covers permissions, MCP boundaries, review packets, evals, observability, cost, and rollback.
You can also read the broader book notes on the Claude Code book page.
I am writing the full field guide in Claude Code: Building Production Agents That Actually Scale. It is for engineers who want agent autonomy with permissions, evals, observability, review packets, and rollback built in from the start.