Building AI agents for production: the four walls
I build AI agents that run in production: a server diagnostics agent with SSH access, a conversation QA evaluator, and a signup verification system that collects threat intelligence. They handle real data every day.
The longer I do this, the more I notice a pattern. The problems have nothing to do with model capability. The models are great and getting better every quarter. My problems are all in the space between the model and the real world. And I do not think they go away with GPT-6 or Gemini 4 or whatever comes next.
Confident and wrong at the same time
My diagnostics agent can look at server logs, check PHP processes, correlate metrics, and produce a root cause analysis that reads like it was written by a senior engineer. Most of the time it is right. The problem is the times it is wrong, because it is wrong with the same confidence.
When a human engineer is unsure, you can see it. They hedge. They say “I think it might be” or “let me check one more thing.” The agent does not do this naturally. It produces a clean, structured diagnosis with a recommended action. If that action is “restart MySQL” and the actual problem is a network partition, you now have a database restart on top of a network issue.
I built self-check mechanisms and confidence scoring to catch this. They help. But here is what really bothers me. When the agent writes its own verification (is my diagnosis correct?), it uses the same reasoning process that produced the diagnosis. It checks its own work using the same assumptions it started with. The verification inherits the blind spots of the original analysis.
I do not have a clean solution for this. What seems to work best is defining hard boundaries up front. Not “check if you are right” but “here are the things that must be true before you take action.” The server must be reachable. The metric must exceed this threshold. The log must contain this specific pattern. Concrete invariants that the agent cannot reason its way around. It is less elegant than having the agent self-evaluate, but it catches the cases where self-evaluation fails.
Consistency dies fast at scale
I have five agents that touch the same support system. A knowledge agent, a diagnostics agent, an ops agent, a billing agent, and an escalation agent. Each one works well in isolation. The challenge is keeping them coherent as a system.
The diagnostics agent develops a way of structuring its findings. The escalation agent develops its own format for handoff notes. The knowledge agent cites articles one way, the ops agent logs actions a different way. Each approach makes sense for that agent. None of them were designed to work together. After a few iterations of improving each agent independently, you end up with a system where the pieces fit but the seams are visible.
With human teams this happens too, but slowly enough that someone usually notices. With agents you can iterate fast and the inconsistency accumulates just as fast. I have seen it in my own system. I improve the diagnostics agent’s output format, then realize the escalation agent is still expecting the old format, and the customer gets a garbled handoff.
The thing that actually helps is having a single source of truth for how agents communicate. Pydantic schemas, shared output formats, explicit contracts between agents. Not guidelines. Contracts. The agent does not get to invent its own format. It fills in the schema or it fails. This is less flexible but it keeps the system from drifting apart as you iterate.
Knowing why things are the way they are (institutional memory)
I have been in hosting support for seven years. I know why our SSH tool uses a jailed user instead of root access. I know why the escalation path goes to live chat tool and not email. I know why certain server operations require confirmation even when the diagnosis is high-confidence. Every one of those decisions has a story behind it. Usually involving something that went wrong.
An agent does not know any of this. It sees the current code and the current configuration. It does not know that we switched from Jira to Slack for escalations because response times were 4x faster. It does not know that we added the confirmation step for cache clearing after an incident where aggressive cache purging caused a traffic spike that took down a customer’s website.
When an agent suggests “we could skip the confirmation for cache clearing since it is a low-risk operation,” it is making a reasonable suggestion based on incomplete information. The operation is low-risk in isolation. It is not low-risk for the customer who had their server go down last time we cleared their cache without warning.
The only fix I have found is making the context explicit and available. Decision logs. Incident summaries indexed and searchable. Not just documentation that sits in a wiki, but context that is actively loaded into the agent’s working memory when it makes decisions in the relevant area. It is a lot of work to set up and maintain. But without it, agents keep rediscovering lessons the team already learned the hard way.
Someone still has to decide what matters
Agents are good at the “how.” They are getting better at it every month. The “what” and the “why” are a different story.
I can ask an agent to optimize a diagnostic workflow. It will do a thorough job. But should we optimize the diagnostic workflow or invest that effort in better knowledge base coverage? Should we prioritize speed or accuracy? If we can only fix one bottleneck this sprint, which one matters more to the business?
These are not technical questions. They depend on what you are trying to build, who your customers are, what you are willing to trade off, and what keeps you competitive. The agent can lay out options. It can estimate effort and impact. But picking the direction requires caring about the outcome in a way that agents do not.
I think the human role eventually comes down to three things. Deciding what to build. Deciding what tradeoffs to accept. And having enough taste to know whether the result is good enough. Everything else is execution, and agents are getting very good at execution. But those three decisions are what make the product yours.
Where this leaves us
None of these problems get solved by better models. A smarter model does not prevent architectural drift across a multi-agent system. A bigger context window does not recover institutional knowledge that was never documented. A more capable agent does not develop opinions about what matters.
These are engineering problems. Systems problems. They need structure, practices, and intentional design. The companies that get agents right will probably not be the ones with the most capable models. They will be the ones that figured out how to keep the agents consistent, grounded in context, and pointed in a direction that someone with skin in the game actually chose.
That is less exciting than “agents will build everything.” But it is closer to how it actually works when you try to ship.