AI code generation is easy to underestimate because the first version looks harmless.
A model suggests a function. You read it. You change a name. You reject half of it. The risk is real, but it is familiar. It feels like code review with a faster autocomplete.
That is not where the interesting production risk starts.
The risk changes when AI code generation becomes action: editing files, running commands, calling tools, opening pull requests, touching generated clients, changing CI, or using MCP servers that reach beyond the repository.
At that point the model is no longer just writing text for a human to copy. It is participating in the software delivery system.
Suggestions are not the same as actions
A code suggestion has a natural brake built in. The human has to do something with it.
That brake is imperfect. Humans skim. Review gets lazy. Bad code can still land. But the workflow is easy to understand: model proposes, human chooses.
A coding agent has a different shape. Claude Code can inspect a repo, plan a change, edit files, run tests, react to errors, and produce a diff. Other AI coding agents do the same from IDEs, sandboxes, CLIs, or agent frameworks. The interface varies. The boundary shift is the same.
When the tool acts, production questions arrive whether the team is ready or not:
- What could it read?
- What could it write?
- Which commands did it run?
- Which tool calls crossed out of the repo?
- Which tests did it actually execute?
- What did it skip?
- How do we undo the work?
That is why I do not like treating production AI code generation as a prompt-quality problem. Better prompts help. They do not replace operating controls.
The productivity story is true, but incomplete
AI code generation can absolutely make teams faster.
It can remove blank-page friction. It can draft tests. It can explain unfamiliar code. It can handle boring refactors. It can turn a half-formed issue into a useful implementation plan.
I use these tools because the productivity is real.
But the production version of the story has a second half: every speed gain changes the review burden. If a coding agent can produce five small diffs in the time a human used to produce one, the team needs a better way to understand those diffs. If the agent can run tools, the team needs a better way to see which tools ran. If the agent can widen its own investigation, the team needs a sharper stop rule.
Otherwise the team gets faster at creating work it cannot properly review.
That is not leverage. That is deferred risk.
The action boundary
The line I care about is the action boundary.
On one side, the model suggests.
On the other, the agent changes state.
State can mean a file on disk, a branch, a package lock, a CI config, a generated client, a test fixture, an issue comment, an internal API call, or a database-adjacent script. Some of those actions are low risk. Some are not. The mistake is pretending they all belong in one bucket called “AI coding”.
A production workflow should classify actions before it automates them:
| Action | Default posture |
|---|---|
| Read public project files | usually safe inside a scoped workspace |
| Edit a narrow module | allowed after task and path boundaries are clear |
| Run targeted tests | usually encouraged, but recorded |
| Change dependencies | review gate |
| Edit CI, deploy, auth, billing, migrations, or generated clients | separate approval |
| Call network or MCP tools | explicit allowlist and audit trail |
| Touch secrets or production data | blocked by default |
This is not anti-agent. It is how useful agents earn more room.
Evidence becomes the product feature
The awkward truth: the final diff is not enough.
A diff tells you what changed. It does not tell you what the agent tried first, which files it inspected, which commands failed, which tests were skipped, or whether it used a tool that changed the risk category of the run.
For serious work, I want a small review record attached to the change:
- the original task
- the allowed paths and tools
- the plan before editing
- the files read and changed
- the commands and tests run
- failed attempts and retries
- risky assumptions
- rollback path
I wrote about this more directly in the Claude Code review packet I want before approving agent work. The short version is simple: a clean diff is useful, but it is not the whole story.
If the agent leaves no evidence, you are not reviewing the run. You are reviewing the artifact that survived the run.
Permissions are not a later hardening step
Teams often start permissive because it makes the demo smoother.
Full repo access. Broad shell access. Network on. A few MCP tools wired in because they are convenient. Then, once the workflow is popular, someone asks whether it is safe enough.
That order is backwards.
Permissions are not polish you add after adoption. They shape the work from the first run. A narrow workspace changes what the agent can accidentally touch. A blocked path changes whether a prompt mistake becomes a file-system mistake. A tool allowlist changes whether an agent can turn a coding task into a business-system task.
This is the point I made in Claude Code permissions: the production mistake that bites later: mode is not blast radius. Tool scope is blast radius.
Evals should come from the runs that worried you
Generic coding benchmarks can be useful, but they do not know your repo.
They do not know that your generated clients should never be hand-edited. They do not know your slow integration test is the one everyone skips. They do not know your billing retry logic is full of weird edge cases from 2019. They do not know which deploy files should require a human.
Production evals should start with the agent runs that made you nervous.
Turn those runs into replayable cases. Did the agent touch a forbidden path? Did it weaken a test? Did it claim success without running the right command? Did it keep exploring after the stop rule? Did it finish with no rollback note?
That is why Claude Code evals should start with bad runs. The bad run is not just a failure. It is a specification for the next guardrail.
Cost is part of the same system
AI code generation in production is not only about correctness and safety. Cost tells you whether the workflow is shaped well.
A vague task makes the agent wander. Broad tool access lets it inspect too much. Missing project context makes it rediscover the same facts every day. No stop rule lets a successful run keep expanding until the diff is harder to review than the original bug.
That is why I think Claude Code agent cost loops start as workflow bugs. Spend is often the smoke. The fire is usually scope.
Rollback belongs before the merge
If an AI coding agent changes production-shaped code, the rollback plan should not arrive after the reviewer asks for it.
It should be part of the task.
A rollback note forces smaller changes. It makes risky files visible. It gives the reviewer an escape path. It also changes the agent’s behavior, because a prompt that asks for reversibility tends to discourage wide cleanup and clever detours.
I covered the prompt-level version in Claude Code rollback plans belong in the prompt. For production AI code generation, rollback is not pessimism. It is the price of moving faster without lying to yourself.
The bridge from AI code generation to production agents
This is the bridge article I think more teams need.
AI code generation is not one thing. It is a ladder:
autocomplete
→ suggested function
→ generated patch
→ tool-using coding agent
→ pull request automation
→ production-adjacent workflow
Each step can be useful. Each step changes the risk.
The mistake is carrying the mental model from the first step into the fifth. A suggestion can be judged mostly by code quality. An agent run has to be judged by boundaries, evidence, tool use, tests, review, cost, and rollback.
That is the operating layer around Claude Code and other AI coding agents.
If your team is still experimenting, keep the experiments small. If you are moving into production-shaped work, build the operating controls before widening autonomy.
Start with the free Claude Code production checklist. It gives you the basic review, permission, observability, eval, cost, and rollback questions to ask before an agent gets more room.
I am writing the deeper field guide in Claude Code: Building Production Agents That Actually Scale. It is for engineers who already believe the tools are useful and now need the system around them to be sane.