Claude Code needs a flight recorder

A Claude Code pull request can look calm by the time it reaches review.

Small diff. Green tests. Sensible summary. Maybe even a neat note about the files it touched.

That is useful, but it is not enough. The part I want to see is the run path: what the agent tried, which tools it used, where it guessed, where it hit a boundary, and what evidence proves the final patch is safe.

In other words, Claude Code needs a flight recorder.

Claude Code flight recorder diagram

I do not mean a giant surveillance system or a compliance theater dashboard. I mean a boring record that helps a reviewer answer one question:

Did this patch come from a controlled run, or did it come from a lucky wander through the repo?

The diff hides too much

Code review is already lossy. A reviewer sees the final patch, not every wrong turn that led to it.

With an AI coding agent, that gap gets wider. Claude Code may read files the reviewer never opens. It may try commands that failed before the final test passed. It may infer behaviour from logs, issue comments, or MCP tools that are not visible in the pull request.

The result can be technically correct and still hard to trust.

That matters most on production-adjacent work: billing paths, migrations, permissions, customer-impacting workflows, incident fixes, and anything that touches operational tooling. A two-line change in one of those areas deserves more than “tests pass”.

It deserves the story of the run.

What to record for every serious run

A useful flight recorder does not need to capture every token. It should capture the operational facts a human would care about later.

For a production Claude Code run, I want this record:

Task contract: what the agent was asked to do
Allowed scope: files, commands, tools, and data it could use
Actual reads: files, tickets, docs, logs, and MCP tools consulted
Actual writes: files changed and commands that modified state
Permission pressure: anything the agent asked for outside scope
Evidence: tests, checks, manual reasoning, and unresolved assumptions
Review packet: what changed, why, risks, and what to inspect first
Rollback note: how to undo or contain the change if it behaves badly

That sounds heavier than a normal pull request template. It is. But only for the runs where the blast radius justifies it.

For a tiny refactor, keep it light. For a change near money, access, production data, or deployment paths, do not pretend the final diff tells the whole truth.

MCP makes this non-optional

MCP is where a coding agent stops being a clever local assistant and starts touching the real work system.

A read-only docs tool is low risk. A ticket lookup is usually fine. Logs with customer context are different. So are tools that can change issue state, inspect production-like data, create migrations, or trigger workflows.

If Claude Code uses MCP during a run, the review packet should say exactly which tools were available and which ones were actually used.

The uncomfortable case is not always “the agent did something bad”. Often it is subtler:

The agent solved the ticket by relying on context from a tool the team did not realise was in scope.

That is not a model failure. It is an observability failure.

You cannot tune permissions if you cannot see the pressure those permissions are under.

A good recorder catches false confidence

The scariest agent runs are not the ones that fail loudly. They are the ones that end with a confident summary after a messy path.

A flight recorder makes false confidence easier to spot. Reviewers can see when Claude Code tried three theories before landing on the fourth. They can see when the passing test was narrow. They can see when the agent avoided a risky tool because the boundary was clear, or when it kept asking to cross that boundary because the task was bigger than expected.

That is the signal teams need.

One pattern I like is a short “evidence ladder” at the end of the run:

What is proved:
- The failing unit test now passes.
- The new regression test fails before the patch and passes after it.
- The changed code is limited to the invoice export path.

What is not proved:
- Behaviour under the full nightly batch.
- Interaction with legacy refund exports.
- Production data shape for older accounts.

That second half is where the honesty lives. It gives the reviewer something concrete to do instead of squinting at a polished summary.

Use the recorder to improve the system

The flight recorder is not only for audit. It is a tuning loop.

After a few weeks, you should be able to answer:

  • Which MCP tools are used often enough to become standard?
  • Which tools create repeated permission pressure?
  • Which prompts produce clean review packets?
  • Which tasks keep escaping their original scope?
  • Which rollback notes are too vague to trust?

That is how a team moves from “we allow Claude Code” to “we know how Claude Code behaves in our workflow”.

I trust that second sentence much more.

Start with one required field

If your team already uses Claude Code, add this to the next production-adjacent run:

Run evidence:
What did Claude Code read, run, assume, and prove before producing this patch?

Do not make it fancy. Just make it mandatory for risky work.

Once that field starts exposing gaps, add the rest: tool inventory, permission pressure, unresolved assumptions, and rollback notes.

The free Claude Code production checklist includes this operating loop alongside permissions, MCP access, evals, review packets, and rollback.

For the full system, I wrote the Claude Code book for engineers and teams moving from impressive solo demos to production use. The landing page points to the Amazon Kindle edition as the main purchase option.

Putting Claude Code near production work?
Start with the free production checklist, then get the Claude Code book for the operating model: permissions, MCP, evals, observability, review, and rollback.