How to Build an AI Agent That Survives Production
The best agents are not magic chat windows. They are narrow workflow systems with clear purpose, typed tools, evals, human review, and staged autonomy.
Most agent projects start in the wrong place.
Someone sees a model do something impressive, then asks where an agent could be added. That usually creates a clever demo and a fragile system. The better starting point is quieter: pick a real workflow, understand where judgment happens, decide what should stay deterministic, and only give the agent autonomy where autonomy earns its keep.
That framing matters because useful agents are less like chatbots and more like small operating systems for work. They sense what changed, interpret context, decide what to do next, call tools, hand off to people, learn from outcomes, and keep a record of what happened.

TakeawayThe best agents match their autonomy to the workflow, the risk, and the evidence behind them.
This is the overview piece in a series on building agents for company work. It maps the whole loop end to end. Later pieces go deep on the parts that earn it: choosing what to automate first, writing tool contracts, treating evals as product features, and letting agents improve without rewriting themselves.
Start With the Workflow
Before prompts, models, vector stores, or frameworks, write down the workflow the agent is supposed to improve.
Who owns the outcome? What triggers the work? What systems contain the needed data? Which decisions are routine, and which need judgment? What would count as a good result? What would count as harm?
This sounds basic, but it prevents the most common failure mode: building an agent with an impressive interface and no operational center of gravity. If the workflow is vague, the agent will be vague too.
- Starts with a model capability
- Prompt absorbs every rule and exception
- Tools appear as the build gets messy
- Success is judged by demo quality
- Ownership is unclear after launch
- Starts with a business outcome
- Scope, non-goals, and owner are explicit
- Tools are designed before implementation
- Success is judged against real fixtures
- Launch includes monitoring and review
For company work, I like mapping an agent through a simple loop: Purpose, Sense, Interpret, Decide, Orchestrate, Learn. Governance runs across all of it.
Purpose: what job does the agent own, what is out of scope, and who is accountable for the result?
Sense: what data, documents, app events, messages, or user context does it need, and how fresh does that information need to be?
Interpret: what policies, exceptions, prior decisions, source citations, or domain rules shape the answer?
Decide: what can the agent decide, what can it recommend, and what requires human approval?
Orchestrate: what tools, APIs, queues, retries, confirmations, and audit logs are part of the work?
Learn: how do evals, traces, human edits, incidents, and production outcomes improve the next version?
If you cannot answer these questions, the next step is discovery, not implementation.
Keep the Agent Smaller Than the System
A mistake I see a lot: the agent becomes responsible for everything because it can technically reason about everything. That is how you get brittle autonomy.
The better pattern is to keep the agent narrow and make the surrounding system strong. Deterministic steps should stay deterministic. Normal app logic should stay normal app logic. The agent should handle the parts that actually benefit from flexible interpretation, recovery, summarization, routing, or tool selection.
For example, an invoice agent does not need to be a free-roaming finance employee. It can extract the invoice, check it against policy, compare it to purchase data, identify exceptions, draft a recommendation, and route anything risky to a human reviewer. That is enough to remove a lot of manual work without pretending the agent is now the accounting department.
TakeawayThe shape of the work earns an agent its autonomy. Model quality alone never does.
An edge twin is a narrow workflow replica at the edge of the business. It handles one real flow end-to-end, but under supervision. The inputs are real, the tools are real, and the outcomes are measurable. The blast radius is intentionally small.
Good candidates are repeatable, high-friction workflows with clear inputs and outputs: intake triage, document processing, lead research, compliance review, report drafting, status updates, reconciliation, or exception routing.
The goal is not to automate the whole company at once. The goal is to prove one loop, build trust, and then expand from evidence.
Give Tools Contracts
Agents become dangerous when their tools are broad, ambiguous, and poorly logged.
A tool should be a contract. It should say what inputs it accepts, what output it returns, what permissions it has, whether it reads or writes, how it fails, whether it is idempotent, and when human approval is required.
Tool names should describe business actions, not implementation details. "Create invoice approval draft" is better than "call endpoint." "List open compliance packets" is better than "query database." The agent should be operating in the language of the workflow.
- One broad database tool
- Read and write mixed together
- Free-form string inputs
- Errors returned as plain text
- No approval boundary
- Small tools mapped to business actions
- Read, write, and destructive paths separated
- Typed inputs and outputs
- Structured failures and retry behavior
- Human approval for sensitive actions
- What is the tool allowed to do?
- What exact schema does it accept and return?
- Is it read-only, write-capable, or destructive?
- What data scope can it access?
- What should happen on timeout, partial failure, or missing data?
- Can the call be retried safely?
- Does it require preview, approval, or rollback?
- Where is the call logged for audit and recovery?
If the agent needs broad ambient credentials to do its job, the tool layer is probably not ready yet.
Treat Evals Like Product Features
A demo tells you whether an agent can work once. Evals tell you whether it can keep working after the prompt changes, the model changes, the data gets weird, or the business process grows new exceptions.
Good evals are not only happy-path examples. They include ambiguous requests, missing data, bad inputs, tool failures, unsafe requests, user corrections, permission boundaries, and cases where the right answer is to ask for help.
You also need traces. For agents, the path matters. Which tool did it call? What did it decide not to do? Did it escalate at the right moment? Did it cite the right source? Did it act within the permission boundary?
- Looks good on three examples
- Only final output is reviewed
- Failures become anecdotes
- Prompt changes are manual judgment
- Production feedback is disconnected
- Fixtures cover edge and failure cases
- Tool calls and escalation paths are reviewed
- Failures become regression cases
- Changes are gated by evals
- Feedback updates the next test set
TakeawayWithout evals, an agent is just a recurring surprise wearing a production badge.
This does not mean every eval has to be elaborate. Start small. Pin a handful of real examples, add synthetic edge cases, and record the expected behavior clearly. The important part is that the agent's future versions have to prove they still satisfy the workflow.
Stage Autonomy Like a Promotion
Autonomy comes down to when, where, and with what evidence.
Most useful agents should move through maturity levels. First they are a draft. Then a production candidate. Then supervised production. Then scheduled supervised production. Only after enough history should they get limited autonomy, and even then only for narrow, reversible, low-risk actions.
How much autonomy is justified?
The right answer depends on risk, evidence, and reversibility.
This is slower than handing the model a giant toolbelt on day one. It is also how you get agents that survive contact with real operations.
Let Agents Learn Without Self-Approval
Agents should learn from their runs. They should notice recurring failures, eval gaps, missing context, bad tool contracts, unclear prompts, and places where human reviewers keep making the same edit.
But learning should not mean the agent silently rewrites its own behavior. That is a different risk category.
A safer pattern is to separate signal from authority. The agent can emit an improvement signal. A separate review path turns signals into proposals. Evals and safety checks run against the proposal. A human or release owner approves meaningful behavior changes. The approved change ships with version notes and rollback.
- Agent edits its own prompt after a bad run
- Feedback is treated as trusted instruction
- Tool permissions expand by convenience
- No rollback note
- No one can explain why behavior changed
- Agent emits a redacted improvement signal
- Proposal names evidence, risk, and affected artifacts
- Eval and safety gates run before adoption
- Owner approval is required for behavior changes
- Release notes and rollback travel with the change
Capture: collect run evidence, reviewer edits, eval failures, incidents, and recurring friction as structured signals.
Propose: turn similar signals into a concrete change request with scope, risk, affected artifacts, and expected impact.
Evaluate: run regression cases, safety checks, permission-boundary checks, and human review where judgment matters.
Release: update prompts, tools, evals, docs, and rollback notes together. Do not let behavior drift invisibly.
Agents can suggest improvements. They should not approve their own expanded authority.
What I Would Build First
If you are starting from zero, do not begin with a general agent platform. Pick one workflow that already hurts.
- Map the current workflow and name the owner.
- Choose one narrow edge-twin candidate with clear inputs and outputs.
- Write the tool contracts before implementation.
- Create 10 to 20 eval fixtures from real and synthetic cases.
- Run supervised first, with human approval on anything sensitive.
- Turn every failure into either a better tool, a better eval, or a clearer boundary.
That is enough to learn quickly without creating a system nobody trusts.
The Bar
Useful agents are built from the same ingredients as useful software: clear ownership, narrow interfaces, good tests, observability, staged rollout, and a way to learn without losing control.
The model matters. The framework matters. But the bigger difference is whether the system around the model understands the work.
Start there.
Keep Reading
The companion pieces in this series go deeper on the parts that decide whether an agent earns its place:
AI Workflow Ranking: what to automate firstA repeatable system audit framework for production software
David Johnsen
Founder, CloudBuddy Solutions
Want to automate a workflow in your business?
Start with free Workflow Mapping to find your highest-value opportunity.
Request workflow mappingMore posts
Fable 5: The Mythos Model Just Went Public
Anthropic released Fable 5 today, a Mythos-class model anyone with an API key can use. I ran it head-to-head against Opus 4.8 and Sonnet 4.6 on launch day, from quick puzzles to a 150,000-line codebase audit. The frontier difference shows up exactly where you'd expect, and not where you wouldn't.
Agent Queues: How AI Turns Backlogs Into Systems
The easiest way to understand useful AI at work is to look at the queue: the inbox, ticket list, approval pile, lead backlog, or report stack where work waits for the next step.
Where Business Advantage Comes From in the Age of AI
AI made software cheap to produce, which means the software itself is no longer the moat. Four advantages still compound: speed of the full loop, anything anchored to the physical world, data that improves with use, and owning one narrow problem end to end.
AI Workflow Ranking: What to Automate First
Most AI projects fail at the first decision: which workflow to build for. AI Workflow Ranking is a repeatable way to score every workflow on readiness and value, then pick the first build that actually pays off.
A Repeatable System Audit Framework for Production Software
A repeatable framework for auditing a SaaS codebase at scale. A set of audit tracks you select and adapt to your system, an invariants loop that stops regressions, and a verification cycle that makes each audit cheaper than the last. One recent application surfaced hundreds of findings and promoted 36 invariants to code-level guardrails.
AI Just Took a Leap. Access Is Becoming a New Advantage.
Anthropic released Mythos Preview to a closed group of organizations. The capability leap is real, but the access model may matter just as much. As AI shifts from an equalizer to a gated advantage, the teams that win will be the ones that can turn that capability into working systems.
Claude Code Leaked. I Looked Under the Hood.
Claude Code CLI accidentally exposed part of its codebase. I pulled the package and looked under the hood. The direction is clear: AI agents are becoming systems.
I Rebuilt My Company Site From My Phone (At the Gym)
What it actually looks like to work with remote AI agents. What worked, what didn't, and how Claude Dispatch compares to our custom AI tools.