Financial services will use AI agents because the work is full of workflows agents can help with.
Research. Reconciliation. Customer operations. Internal controls. Software delivery. Document review. Case triage. Exception handling. Fraud investigation. Reporting prep. The list is long, and some of it is painfully manual.
So I am not interested in the lazy argument that agents are all hype.
The better argument is sharper: in financial services, an AI agent is only as useful as the control surface around it.
The demo will always understate the risk
A good demo chooses a clean path.
The agent reads a policy. It summarizes a case. It drafts a response. It opens a ticket. It suggests the next action. Everyone can see the efficiency.
Production is not that clean.
The customer record may be incomplete. The policy may have regional exceptions. The account may be subject to a restriction. The source document may be stale. The agent may need a tool. The tool may write state. The action may require segregation of duties. The output may become part of an audit trail.
That is where the demo story runs out.
Treat agency like access
The phrase “AI agent” makes people think about intelligence.
For production design, I would rather think about access.
What can it read? What can it write? Which system can it call? Which action needs approval? Which action is forbidden? Which records must be masked? Which run is sampled for review? Which decision must remain human?
An agent with broad tools is a privileged actor. It may not have intent, but it has reach. Reach is what controls are for.
A sensible progression looks like this:
| Stage | Agent can do | Control posture |
|---|---|---|
| assist | summarize and draft | human owns the action |
| recommend | propose next step | evidence and source links required |
| prepare | fill forms or tickets | human approval before submit |
| execute narrow action | call approved tool | audit trail and rollback path |
| execute wider workflow | coordinate multiple tools | separate risk review |
Most teams should spend longer in the middle than they want to.
That is not cowardice. It is how you find the real boundary.
Audit trail is a feature, not paperwork
Financial systems already know this. If a human changes something important, we care who did it, when, why, and under what authority.
Agents do not get a free pass because the interface feels conversational.
For every meaningful agent action, I want a record:
- original user request
- policy or document sources used
- model and prompt version
- tools available at the time
- tool calls attempted
- approval or rejection event
- final action taken
- rollback option
- human owner
If the agent only drafts text, the trail can be lighter. If it changes state, the trail needs to be serious.
This is where AI engineering meets old-fashioned operational control. Good. The old-fashioned part exists for a reason.
Evals should include policy and abuse cases
A financial-services agent needs more than accuracy tests.
It needs cases that represent pressure:
- ambiguous customer requests
- conflicting policy documents
- restricted account scenarios
- prompt injection inside retrieved text
- attempts to bypass approval
- requests for sensitive data
- tool failures
- stale or missing records
- regional policy differences
OWASP’s LLM risk categories are useful here because they make the abuse cases concrete. Prompt injection and excessive agency are not theoretical when an agent can call tools. Sensitive information disclosure is not theoretical when traces and prompts contain customer context.
The eval suite should include bad behavior the business actually fears.
Human-in-the-loop is not one thing
People use “human in the loop” as if it solves the problem by itself.
It does not.
A tired human approving a vague agent summary is not a control. A reviewer with sources, diffs, tool calls, confidence notes, and a clear approval question might be.
The design question is: what exactly is the human approving?
- the answer wording?
- the cited source?
- the tool call?
- the customer-visible action?
- the exception decision?
- the rollback plan?
If the approval screen hides the evidence, the control is theatre.
The first production use cases should be narrow
I would start with agent workflows that are useful but bounded:
- internal research over approved documents
- case summarization with source links
- draft responses that require human send
- reconciliation explanation, not automatic correction
- engineering runbooks that prepare commands but do not execute production changes
- compliance evidence gathering, not final sign-off
These workflows still need controls, but they let the organization learn without pretending the agent is ready to own a regulated process.
Autonomy can widen later. It should widen because the evidence supports it, not because the demo was exciting.
The rule I would use
An AI agent in financial services can get more autonomy when it leaves better evidence than the manual process it replaces.
That is the bar.
If the agent makes the process faster but harder to audit, be careful. If it makes the process faster and easier to inspect, now there is something worth scaling.
The future of AI agents in finance will not be decided by prompt cleverness alone. It will be decided by permission design, evals, observability, governance, and boring operational discipline.
Boring is not the enemy here.
Boring is what lets useful systems stay useful.