We shipped our first agent in March 2022. Since then, we've deployed over 4,200 agents across 340+ customers. Some have processed millions of transactions. Others failed spectacularly.
Here's what we've learned.
Start with the unhappy path
Every team starts by building for the success case. The agent works great when everything goes right. Then production happens.
The API times out. The customer provides malformed input. The third-party service returns unexpected data. The LLM hallucinates a field that doesn't exist.
We now start every project by mapping failure modes. What can go wrong at each step? How do we detect it? How do we recover? What does the human escalation path look like?
This feels slow at the start. It's much faster than debugging production issues at 2 AM.
Confidence scores change everything
Early on, we built agents that just did things. Now, every action includes a confidence score. When confidence is high, the agent proceeds autonomously. When it's low, the agent flags for human review.
The threshold varies by use case. For a customer support agent, 85% might be fine. For a financial processing agent, we might require 95%+.
This simple pattern has reduced our error rates by 73% while maintaining high automation rates. The key insight: it's okay for agents to not know everything, as long as they know what they don't know.
The 'last mile' is 80% of the work
Getting an agent to work in demo is maybe 20% of the effort. Getting it to work reliably in production with all the edge cases, integrations, and monitoring is the other 80%.
We've learned to set expectations accordingly. When a customer asks how long an agent will take, we think about the complexity and then double it. We're usually still optimistic.
Monitoring is not optional
You can't improve what you can't measure. Every agent we deploy includes comprehensive monitoring: latency, accuracy, confidence distributions, error rates, cost per action.
We review these metrics weekly with customers. Patterns emerge. Maybe accuracy drops on Mondays (people submit messier data after weekends). Maybe certain input types consistently cause problems.
This isn't glamorous work, but it's what separates agents that improve over time from agents that slowly degrade.
The model is the easy part
Everyone obsesses over model selection. GPT-4 vs Claude vs Gemini. Fine-tuning vs prompting. Context windows and token limits.
In our experience, the model is maybe 10% of what makes an agent work. The other 90% is: data quality, integration reliability, error handling, monitoring, and iteration speed.
We've seen simple prompts with good data dramatically outperform sophisticated approaches with messy inputs. Get the foundations right first.
Build for change
Models improve. APIs change. Business requirements evolve. The agents we build today will need to be different in six months.
We design for this explicitly. Prompts are versioned and easily swappable. Integrations are abstracted behind clean interfaces. Monitoring tracks performance across versions.
The best agents aren't the most sophisticated—they're the most adaptable.