The AI POC is no longer the impressive part.

A small team can wire up a model, retrieval, a prompt, a tool call, and a slick internal demo in a week. Sometimes in a day. The room nods. Someone says “this could save hours.” Someone else asks when it can go live.

That is usually the moment the project becomes dangerous.

Not because the model is useless. Often it is useful. The danger is that a working demo borrows trust from the production systems around it. It looks like software, so people assume the operating model exists somewhere. Most of the time, it does not.

From AI POC to production operating model

A POC answers the wrong question

A POC asks: can this be made to work?

Production asks different questions:

  • who owns the result when it is wrong?
  • what data is it allowed to see?
  • what is the expected quality bar?
  • which failures are acceptable?
  • how will we know the model drifted?
  • what will this cost at normal usage?
  • what gets logged?
  • what must never be logged?
  • how do we roll back?

Those questions are not bureaucracy. They are the difference between experimenting with AI and operating AI.

This is why so many AI projects stall after the demo. The POC was built to prove capability. The production service needs to prove control.

The operating model is the missing artifact

For a normal software service, the operating model is often implicit. We know the shape: CI, deployment, monitoring, alerts, runbooks, ownership, incident response, access control, release notes.

Generative AI breaks that muscle memory in subtle ways.

The behavior is probabilistic. The inputs are often open-ended. The output can be plausible and wrong. Retrieval changes the answer surface. Tool calls can change state. Evaluation is not just unit tests. Logs may contain sensitive prompts or customer data. A small prompt edit can behave like a code change and a product-policy change at the same time.

So the operating model has to be explicit.

At minimum, I want to see six things before a serious AI feature goes beyond a narrow pilot.

1. An owner who can say no

Every production AI system needs a real owner. Not an innovation sponsor. Not a steering committee. An owner.

That owner must be able to say:

  • this use case is not ready
  • this model change does not ship
  • this data source is out of scope
  • this failure rate is too high
  • this tool call needs a human gate

Without that authority, the AI system becomes a shared enthusiasm object. Everyone likes the demo. Nobody owns the failure.

In financial services, this matters even more. A bad answer is rarely just a bad answer. It can become a customer outcome, a compliance issue, a trading decision, a support promise, or a control failure.

2. Evals based on actual failure modes

A generic benchmark does not know your business.

It does not know your legacy policy wording. It does not know which product names confuse customers. It does not know that one internal document is stale but still highly ranked in search. It does not know that your support agents use three different names for the same exception process.

Production evals should start with the cases that worry the team.

Build a small, versioned set of examples:

Eval caseWhy it matters
confusing customer questiontests whether the model asks for clarification
stale retrieval documenttests source ranking and refusal behavior
policy edge casetests whether the answer invents certainty
prompt injection attempttests instruction hierarchy
tool-call requesttests permission and approval boundaries
regulated phrasetests wording and escalation rules

Do not wait for a perfect eval platform. A simple file of cases, expected behaviors, and reviewer notes is better than a dashboard with no teeth.

The point is not to get a magic score. The point is to make quality visible enough that a model, prompt, retrieval, or tool change can be argued about like an engineering change.

3. Observability that shows the run, not just the graph

Normal service observability tells you whether the system is slow, expensive, or broken.

AI observability has to tell you what happened inside a single answer.

For a production AI run, I want the trail:

  • user input or sanitized input
  • system and task instructions
  • retrieved sources
  • tool calls requested
  • tool calls approved or blocked
  • model version
  • prompt version
  • output
  • eval checks
  • human override, if any

OpenAI’s Agents SDK documentation treats tracing as a first-class concept for agent runs. LangSmith and other LLM observability tools take a similar view: you need to inspect the chain, not just the HTTP request.

That direction is right. If a team cannot replay the path to a bad answer, it cannot operate the system. It can only apologize for it.

4. A permission model for tools and data

The risk jumps when an AI system stops answering and starts acting.

A chatbot that summarizes a document can be wrong. An agent that updates a ticket, calls a payment workflow, changes a customer record, or triggers a downstream job can be wrong with consequences.

Tool access should be treated like production access:

  • read-only before write
  • narrow scopes before broad scopes
  • sandbox before live systems
  • human approval before irreversible actions
  • separate permission for customer data, money movement, credentials, and regulated records

OWASP’s Top 10 for Large Language Model Applications is useful here because it names the security problems in language engineering teams can work with: prompt injection, sensitive information disclosure, insecure output handling, excessive agency, and supply-chain risk.

The phrase I keep coming back to is excessive agency. It is the cleanest warning label for many production AI mistakes. The model did not need that much room.

5. Cost as an architectural signal

Cost is not just finance. Cost is feedback.

When an AI workflow is too expensive, the reason is often architectural:

  • the prompt is carrying context the system should know
  • retrieval is noisy
  • the agent is allowed to wander
  • the model is too large for the task
  • retries hide quality problems
  • no one defined a stop rule

A useful production review asks where the tokens go. If half the spend is repeated boilerplate, stale context, unnecessary retrieval, or agent loops, the fix is not only a cheaper model. The fix is a tighter system.

This is the same discipline good engineering teams already use for cloud cost. Waste usually points to design.

6. Rollout and rollback before applause

The worst AI rollout plan is “ship it and watch Slack.”

A better one is boring:

  • limited audience
  • clear success criteria
  • sampled human review
  • kill switch
  • prompt and model versioning
  • rollback path
  • post-launch eval review
  • support path for bad answers

NIST’s AI Risk Management Framework uses the language of govern, map, measure, and manage. That may sound heavy for a blog post, but the underlying point is practical: know the context, measure the risk, assign responsibility, and keep managing it after launch.

That is production work. It is not anti-innovation. It is how useful AI survives contact with customers, auditors, on-call engineers, and tired humans on a Friday afternoon.

The POC-to-production checklist I would use

Before moving a generative AI feature beyond a small pilot, I would ask for this packet:

  1. the use case and non-use cases
  2. data sources and data that must never be used
  3. model, prompt, and retrieval versions
  4. eval cases from real failure modes
  5. observability trail for a single run
  6. tool permissions and approval gates
  7. security review against OWASP LLM risks
  8. cost estimate at realistic usage
  9. named owner and support path
  10. rollout, rollback, and kill-switch plan

If that sounds like too much, the feature is probably not production-shaped yet.

And that is fine. Keep it a POC. Learn from it. Use it internally. Narrow the use case.

Just do not confuse a demo that works with a system that can be operated.

Sources worth reading