By David Johnsen·June 24, 2026·

AIagentsautomationoperations

How to Build an AI Agent That Survives Production

The best agents are not magic chat windows. They are narrow workflow systems with clear purpose, typed tools, evals, human review, and staged autonomy.

Most agent projects start in the wrong place.

Someone sees a model do something impressive, then asks where an agent could be added. That usually creates a clever demo and a fragile system. The better starting point is quieter: pick a real workflow, understand where judgment happens, decide what should stay deterministic, and only give the agent autonomy where autonomy earns its keep.

That framing matters because useful agents are less like chatbots and more like small operating systems for work. They sense what changed, interpret context, decide what to do next, call tools, hand off to people, learn from outcomes, and keep a record of what happened.

Takeaway
The best agents match their autonomy to the workflow, the risk, and the evidence behind them.

operating-loop layers

Purpose through Govern

maturity stages before trust

draft to limited autonomy

governed learning

agents that approve their own changes

This is the overview piece in a series on building agents for company work. It maps the whole loop end to end. Later pieces go deep on the parts that earn it: choosing what to automate first, writing tool contracts, treating evals as product features, and letting agents improve without rewriting themselves.

Start With the Workflow

Before prompts, models, vector stores, or frameworks, write down the workflow the agent is supposed to improve.

Who owns the outcome? What triggers the work? What systems contain the needed data? Which decisions are routine, and which need judgment? What would count as a good result? What would count as harm?

This sounds basic, but it prevents the most common failure mode: building an agent with an impressive interface and no operational center of gravity. If the workflow is vague, the agent will be vague too.

Model-first agent

Starts with a model capability
Prompt absorbs every rule and exception
Tools appear as the build gets messy
Success is judged by demo quality
Ownership is unclear after launch

Workflow-first agent

Starts with a business outcome
Scope, non-goals, and owner are explicit
Tools are designed before implementation
Success is judged against real fixtures
Launch includes monitoring and review

For company work, I like mapping an agent through a simple loop: Purpose, Sense, Interpret, Decide, Orchestrate, Learn. Governance runs across all of it.

Purpose: what job does the agent own, what is out of scope, and who is accountable for the result?

Sense: what data, documents, app events, messages, or user context does it need, and how fresh does that information need to be?

Interpret: what policies, exceptions, prior decisions, source citations, or domain rules shape the answer?

Decide: what can the agent decide, what can it recommend, and what requires human approval?

Orchestrate: what tools, APIs, queues, retries, confirmations, and audit logs are part of the work?

Learn: how do evals, traces, human edits, incidents, and production outcomes improve the next version?

What changes

If you cannot answer these questions, the next step is discovery, not implementation.

Keep the Agent Smaller Than the System

A mistake I see a lot: the agent becomes responsible for everything because it can technically reason about everything. That is how you get brittle autonomy.

The better pattern is to keep the agent narrow and make the surrounding system strong. Deterministic steps should stay deterministic. Normal app logic should stay normal app logic. The agent should handle the parts that actually benefit from flexible interpretation, recovery, summarization, routing, or tool selection.

For example, an invoice agent does not need to be a free-roaming finance employee. It can extract the invoice, check it against policy, compare it to purchase data, identify exceptions, draft a recommendation, and route anything risky to a human reviewer. That is enough to remove a lot of manual work without pretending the agent is now the accounting department.

Takeaway
The shape of the work earns an agent its autonomy. Model quality alone never does.

An edge twin is a narrow workflow replica at the edge of the business. It handles one real flow end-to-end, but under supervision. The inputs are real, the tools are real, and the outcomes are measurable. The blast radius is intentionally small.

Good candidates are repeatable, high-friction workflows with clear inputs and outputs: intake triage, document processing, lead research, compliance review, report drafting, status updates, reconciliation, or exception routing.

The goal is not to automate the whole company at once. The goal is to prove one loop, build trust, and then expand from evidence.

Give Tools Contracts

Agents become dangerous when their tools are broad, ambiguous, and poorly logged.

A tool should be a contract. It should say what inputs it accepts, what output it returns, what permissions it has, whether it reads or writes, how it fails, whether it is idempotent, and when human approval is required.

Tool names should describe business actions, not implementation details. "Create invoice approval draft" is better than "call endpoint." "List open compliance packets" is better than "query database." The agent should be operating in the language of the workflow.

Weak tool surface

One broad database tool
Read and write mixed together
Free-form string inputs
Errors returned as plain text
No approval boundary

Strong tool surface

Small tools mapped to business actions
Read, write, and destructive paths separated
Typed inputs and outputs
Structured failures and retry behavior
Human approval for sensitive actions

What is the tool allowed to do?
What exact schema does it accept and return?
Is it read-only, write-capable, or destructive?
What data scope can it access?
What should happen on timeout, partial failure, or missing data?
Can the call be retried safely?
Does it require preview, approval, or rollback?
Where is the call logged for audit and recovery?

What changes

If the agent needs broad ambient credentials to do its job, the tool layer is probably not ready yet.

Treat Evals Like Product Features

A demo tells you whether an agent can work once. Evals tell you whether it can keep working after the prompt changes, the model changes, the data gets weird, or the business process grows new exceptions.

Good evals are not only happy-path examples. They include ambiguous requests, missing data, bad inputs, tool failures, unsafe requests, user corrections, permission boundaries, and cases where the right answer is to ask for help.

You also need traces. For agents, the path matters. Which tool did it call? What did it decide not to do? Did it escalate at the right moment? Did it cite the right source? Did it act within the permission boundary?

Demo confidence

Looks good on three examples
Only final output is reviewed
Failures become anecdotes
Prompt changes are manual judgment
Production feedback is disconnected

Release confidence

Fixtures cover edge and failure cases
Tool calls and escalation paths are reviewed
Failures become regression cases
Changes are gated by evals
Feedback updates the next test set

Takeaway
Without evals, an agent is just a recurring surprise wearing a production badge.

This does not mean every eval has to be elaborate. Start small. Pin a handful of real examples, add synthetic edge cases, and record the expected behavior clearly. The important part is that the agent's future versions have to prove they still satisfy the workflow.

Stage Autonomy Like a Promotion

Autonomy comes down to when, where, and with what evidence.

Most useful agents should move through maturity levels. First they are a draft. Then a production candidate. Then supervised production. Then scheduled supervised production. Only after enough history should they get limited autonomy, and even then only for narrow, reversible, low-risk actions.

How much autonomy is justified?

The right answer depends on risk, evidence, and reversibility.

Use real-ish examples and fast iteration. Keep tools read-only or mocked. The goal is to learn whether the workflow shape is right before you wire it into production systems.

Success means the agent can explain the work, identify missing context, and produce useful drafts. It should not be making real changes yet.

This is slower than handing the model a giant toolbelt on day one. It is also how you get agents that survive contact with real operations.

Let Agents Learn Without Self-Approval

Agents should learn from their runs. They should notice recurring failures, eval gaps, missing context, bad tool contracts, unclear prompts, and places where human reviewers keep making the same edit.

But learning should not mean the agent silently rewrites its own behavior. That is a different risk category.

A safer pattern is to separate signal from authority. The agent can emit an improvement signal. A separate review path turns signals into proposals. Evals and safety checks run against the proposal. A human or release owner approves meaningful behavior changes. The approved change ships with version notes and rollback.

Unsafe learning

Agent edits its own prompt after a bad run
Feedback is treated as trusted instruction
Tool permissions expand by convenience
No rollback note
No one can explain why behavior changed

Governed learning

Agent emits a redacted improvement signal
Proposal names evidence, risk, and affected artifacts
Eval and safety gates run before adoption
Owner approval is required for behavior changes
Release notes and rollback travel with the change

Capture: collect run evidence, reviewer edits, eval failures, incidents, and recurring friction as structured signals.

Propose: turn similar signals into a concrete change request with scope, risk, affected artifacts, and expected impact.

Evaluate: run regression cases, safety checks, permission-boundary checks, and human review where judgment matters.

Release: update prompts, tools, evals, docs, and rollback notes together. Do not let behavior drift invisibly.

What changes

Agents can suggest improvements. They should not approve their own expanded authority.

What I Would Build First

If you are starting from zero, do not begin with a general agent platform. Pick one workflow that already hurts.

Map the current workflow and name the owner.
Choose one narrow edge-twin candidate with clear inputs and outputs.
Write the tool contracts before implementation.
Create 10 to 20 eval fixtures from real and synthetic cases.
Run supervised first, with human approval on anything sensitive.
Turn every failure into either a better tool, a better eval, or a clearer boundary.

That is enough to learn quickly without creating a system nobody trusts.

The compounding advantage is not one brilliant prompt. It is a workflow that gets clearer, safer, and easier to run every time it sees reality.

The Bar

Useful agents are built from the same ingredients as useful software: clear ownership, narrow interfaces, good tests, observability, staged rollout, and a way to learn without losing control.

The model matters. The framework matters. But the bigger difference is whether the system around the model understands the work.

Start there.

Keep Reading

The companion pieces in this series go deeper on the parts that decide whether an agent earns its place:

AI Workflow Ranking: what to automate first A repeatable system audit framework for production software

David Johnsen

Founder, CloudBuddy Solutions

Want to automate a workflow in your business?

Start with free Workflow Mapping to find your highest-value opportunity.

Request workflow mapping

Fable 5: The Mythos Model Just Went Public

Anthropic released Fable 5 today, a Mythos-class model anyone with an API key can use. I ran it head-to-head against Opus 4.8 and Sonnet 4.6 on launch day, from quick puzzles to a 150,000-line codebase audit. The frontier difference shows up exactly where you'd expect, and not where you wouldn't.

Agent Queues: How AI Turns Backlogs Into Systems

The easiest way to understand useful AI at work is to look at the queue: the inbox, ticket list, approval pile, lead backlog, or report stack where work waits for the next step.

Where Business Advantage Comes From in the Age of AI

AI made software cheap to produce, which means the software itself is no longer the moat. Four advantages still compound: speed of the full loop, anything anchored to the physical world, data that improves with use, and owning one narrow problem end to end.

AI Workflow Ranking: What to Automate First

Most AI projects fail at the first decision: which workflow to build for. AI Workflow Ranking is a repeatable way to score every workflow on readiness and value, then pick the first build that actually pays off.

A Repeatable System Audit Framework for Production Software

A repeatable framework for auditing a SaaS codebase at scale. A set of audit tracks you select and adapt to your system, an invariants loop that stops regressions, and a verification cycle that makes each audit cheaper than the last. One recent application surfaced hundreds of findings and promoted 36 invariants to code-level guardrails.

AI Just Took a Leap. Access Is Becoming a New Advantage.

Anthropic released Mythos Preview to a closed group of organizations. The capability leap is real, but the access model may matter just as much. As AI shifts from an equalizer to a gated advantage, the teams that win will be the ones that can turn that capability into working systems.

Claude Code Leaked. I Looked Under the Hood.

Claude Code CLI accidentally exposed part of its codebase. I pulled the package and looked under the hood. The direction is clear: AI agents are becoming systems.

I Rebuilt My Company Site From My Phone (At the Gym)

What it actually looks like to work with remote AI agents. What worked, what didn't, and how Claude Dispatch compares to our custom AI tools.

Browse workflow ideas →

Start With the Workflow

Keep the Agent Smaller Than the System

Give Tools Contracts

Treat Evals Like Product Features

Stage Autonomy Like a Promotion

Let Agents Learn Without Self-Approval

What I Would Build First

The Bar

Keep Reading

More posts