From AI POC to production: the part teams keep skipping

The AI POC is no longer the impressive part.

A small team can wire up a model, retrieval, a prompt, a tool call, and a slick internal demo in a week. Sometimes in a day. The room nods. Someone says “this could save hours.” Someone else asks when it can go live.

That is usually the moment the project becomes dangerous.

Not because the model is useless. Often it is useful. The danger is that a working demo borrows trust from the production systems around it. It looks like software, so people assume the operating model exists somewhere. Most of the time, it does not.

From AI POC to production operating model

A POC answers the wrong question

A POC asks: can this be made to work?

Production asks different questions:

who owns the result when it is wrong?
what data is it allowed to see?
what is the expected quality bar?
which failures are acceptable?
how will we know the model drifted?
what will this cost at normal usage?
what gets logged?
what must never be logged?
how do we roll back?

Those questions are not bureaucracy. They are the difference between experimenting with AI and operating AI.

This is why so many AI projects stall after the demo. The POC was built to prove capability. The production service needs to prove control.

The operating model is the missing artifact

For a normal software service, the operating model is often implicit. We know the shape: CI, deployment, monitoring, alerts, runbooks, ownership, incident response, access control, release notes.

Generative AI breaks that muscle memory in subtle ways.

The behavior is probabilistic. The inputs are often open-ended. The output can be plausible and wrong. Retrieval changes the answer surface. Tool calls can change state. Evaluation is not just unit tests. Logs may contain sensitive prompts or customer data. A small prompt edit can behave like a code change and a product-policy change at the same time.

So the operating model has to be explicit.

At minimum, I want to see six things before a serious AI feature goes beyond a narrow pilot.

1. An owner who can say no

Every production AI system needs a real owner. Not an innovation sponsor. Not a steering committee. An owner.

That owner must be able to say:

this use case is not ready
this model change does not ship
this data source is out of scope
this failure rate is too high
this tool call needs a human gate

Without that authority, the AI system becomes a shared enthusiasm object. Everyone likes the demo. Nobody owns the failure.

In financial services, this matters even more. A bad answer is rarely just a bad answer. It can become a customer outcome, a compliance issue, a trading decision, a support promise, or a control failure.

2. Evals based on actual failure modes

A generic benchmark does not know your business.

It does not know your legacy policy wording. It does not know which product names confuse customers. It does not know that one internal document is stale but still highly ranked in search. It does not know that your support agents use three different names for the same exception process.

Production evals should start with the cases that worry the team.

Build a small, versioned set of examples:

Eval case	Why it matters
confusing customer question	tests whether the model asks for clarification
stale retrieval document	tests source ranking and refusal behavior
policy edge case	tests whether the answer invents certainty
prompt injection attempt	tests instruction hierarchy
tool-call request	tests permission and approval boundaries
regulated phrase	tests wording and escalation rules

Do not wait for a perfect eval platform. A simple file of cases, expected behaviors, and reviewer notes is better than a dashboard with no teeth.

The point is not to get a magic score. The point is to make quality visible enough that a model, prompt, retrieval, or tool change can be argued about like an engineering change.

3. Observability that shows the run, not just the graph

Normal service observability tells you whether the system is slow, expensive, or broken.

AI observability has to tell you what happened inside a single answer.

For a production AI run, I want the trail:

user input or sanitized input
system and task instructions
retrieved sources
tool calls requested
tool calls approved or blocked
model version
prompt version
output
eval checks
human override, if any

OpenAI’s Agents SDK documentation treats tracing as a first-class concept for agent runs. LangSmith and other LLM observability tools take a similar view: you need to inspect the chain, not just the HTTP request.

That direction is right. If a team cannot replay the path to a bad answer, it cannot operate the system. It can only apologize for it.

4. A permission model for tools and data

The risk jumps when an AI system stops answering and starts acting.

A chatbot that summarizes a document can be wrong. An agent that updates a ticket, calls a payment workflow, changes a customer record, or triggers a downstream job can be wrong with consequences.

Tool access should be treated like production access:

read-only before write
narrow scopes before broad scopes
sandbox before live systems
human approval before irreversible actions
separate permission for customer data, money movement, credentials, and regulated records

OWASP’s Top 10 for Large Language Model Applications is useful here because it names the security problems in language engineering teams can work with: prompt injection, sensitive information disclosure, insecure output handling, excessive agency, and supply-chain risk.

The phrase I keep coming back to is excessive agency. It is the cleanest warning label for many production AI mistakes. The model did not need that much room.

5. Cost as an architectural signal

Cost is not just finance. Cost is feedback.

When an AI workflow is too expensive, the reason is often architectural:

the prompt is carrying context the system should know
retrieval is noisy
the agent is allowed to wander
the model is too large for the task
retries hide quality problems
no one defined a stop rule

A useful production review asks where the tokens go. If half the spend is repeated boilerplate, stale context, unnecessary retrieval, or agent loops, the fix is not only a cheaper model. The fix is a tighter system.

This is the same discipline good engineering teams already use for cloud cost. Waste usually points to design.

6. Rollout and rollback before applause

The worst AI rollout plan is “ship it and watch Slack.”

A better one is boring:

limited audience
clear success criteria
sampled human review
kill switch
prompt and model versioning
rollback path
post-launch eval review
support path for bad answers

NIST’s AI Risk Management Framework uses the language of govern, map, measure, and manage. That may sound heavy for a blog post, but the underlying point is practical: know the context, measure the risk, assign responsibility, and keep managing it after launch.

That is production work. It is not anti-innovation. It is how useful AI survives contact with customers, auditors, on-call engineers, and tired humans on a Friday afternoon.

The POC-to-production checklist I would use

Before moving a generative AI feature beyond a small pilot, I would ask for this packet:

the use case and non-use cases
data sources and data that must never be used
model, prompt, and retrieval versions
eval cases from real failure modes
observability trail for a single run
tool permissions and approval gates
security review against OWASP LLM risks
cost estimate at realistic usage
named owner and support path
rollout, rollback, and kill-switch plan

If that sounds like too much, the feature is probably not production-shaped yet.

And that is fine. Keep it a POC. Learn from it. Use it internally. Narrow the use case.

Just do not confuse a demo that works with a system that can be operated.

A POC answers the wrong question#

The operating model is the missing artifact#

1. An owner who can say no#

2. Evals based on actual failure modes#

3. Observability that shows the run, not just the graph#

4. A permission model for tools and data#

5. Cost as an architectural signal#

6. Rollout and rollback before applause#

The POC-to-production checklist I would use#

Sources worth reading#