How We Build AI Agents That Actually Work in Production

The AI agent hype cycle peaked somewhere around late 2024. Since then, the organizations that got past the demo stage have learned a hard lesson: an agent that works in a proof-of-concept environment almost never works in production without a significant redesign.

The failure patterns are consistent enough that we’ve built our entire agent development process around avoiding them. Here’s the framework.

What actually goes wrong with AI agents

Before designing a solution, it helps to understand why agent deployments fail.

Agents hallucinate on edge cases. Language models are trained to produce plausible, fluent output. When an agent encounters an input it hasn’t been designed for, it will often generate a confident, coherent, but wrong response rather than escalating or asking for clarification. In a research task, that might be a minor annoyance. In a workflow that triggers financial transactions or customer communications, it’s a serious problem.

The environment is messier than the demo. Demos use clean, well-formatted inputs. Production environments have duplicated records, inconsistent naming conventions, missing fields, and edge cases the demo never considered. Agents built around clean inputs break on messy real-world data.

Latency compounds. An agent that chains multiple LLM calls in sequence inherits the latency of every step. A task that runs three sequential Claude API calls might take 15 to 30 seconds under normal conditions. Add retry logic for rate limits and the occasional slow response, and users experience multi-minute delays that erode trust quickly.

No visibility into what the agent is doing. When an agent runs silently and produces a result, nobody knows why it made the decisions it made. When something goes wrong — and eventually something will — there’s no audit trail to diagnose.

The design principles we use

Every agent we build follows the same set of principles, regardless of the use case.

Narrow the scope aggressively. The most reliable agents do one thing well. A research agent that searches a defined set of sources, extracts specific types of information, and returns a structured result is more reliable than a general-purpose research agent that tries to figure out what you want. Narrow scope means predictable behavior.

Treat the LLM as a reasoning layer, not a routing layer. We use Claude (or another model where appropriate) for tasks that require judgment: extracting meaning from unstructured text, categorising content, generating summaries, evaluating whether a piece of information meets criteria. We use deterministic code — in n8n or elsewhere — for routing, sequencing, data transformation, and everything else that doesn’t require reasoning.

This distinction matters enormously. Routing logic that runs through a language model is non-deterministic and slow. Routing logic in code is fast, testable, and consistent.

Build for failure. Every step that can fail should have explicit handling for what happens when it does. Rate limits, API timeouts, unexpected response formats, missing fields in source data — each one needs a defined path, whether that’s a retry, a fallback, or an escalation to a human.

Log everything. Every agent run should produce a structured log that includes what inputs were received, what decisions were made, what external services were called, what outputs were generated, and how long each step took. Without this, debugging becomes guesswork.

Put a human in the loop for high-stakes decisions. An agent can do 90% of the work on a high-stakes task and still require human sign-off before the final action. This is not a failure of automation — it’s appropriate design. A contract review agent that extracts key terms, flags risks, and drafts a summary for a lawyer to approve is genuinely useful even though a human makes the final call.

A real example: competitive intelligence monitoring

One of our clients wanted to track competitor pricing, product announcements, and hiring patterns across a set of 15 competitors. Manually, this took a team member roughly 6 hours per week: searching company websites, monitoring LinkedIn, checking review platforms, and compiling a summary report.

We built an n8n-orchestrated agent system that runs on a Monday morning schedule. The system:

Fetches the careers pages of all 15 competitors using structured scraping
Checks for new job postings and categorises them by department
Fetches and stores recent social media content from specified accounts
Runs the collected content through Claude with a targeted extraction prompt to identify mentions of pricing changes, new features, or significant announcements
Formats the results into a structured briefing document
Delivers the briefing via email and posts a summary to the client’s Slack channel

The entire process runs in about 4 minutes. The team member who previously spent 6 hours on this task now spends 15 minutes reviewing the briefing and adding context. Annual time savings: roughly 280 hours. Build cost: $2,200.

The model selection question

One of the most common mistakes we see is using the most capable — and most expensive — model for every task in an agent workflow.

Claude Opus is exceptional at nuanced reasoning, complex analysis, and creative tasks. It costs roughly $15 per million input tokens. Claude Haiku handles structured extraction, classification, and straightforward summarisation at $0.25 per million input tokens — 60x cheaper for tasks where the difference in output quality is negligible.

A well-designed agent workflow routes tasks to the appropriate model based on complexity. Simple extraction tasks use Haiku. Tasks that require judgment, synthesis, or nuanced interpretation use Sonnet or Opus. This model routing approach typically reduces LLM API costs by 70 to 85% compared to using a single high-capability model throughout.

We track per-task token consumption for every agent we build so clients can see exactly what each agent run costs and where the budget goes.

What production-ready actually means

An agent is production-ready when it can run unsupervised through the full range of inputs it will realistically encounter and produce outputs your team can trust. That bar is higher than most demos suggest.

Getting there typically takes two to three weeks of testing against real data: stress-testing with edge cases, measuring failure rates, tuning prompts for consistency, and validating that the escalation logic works correctly. We don’t hand off an agent to a client until we’ve run it against at least 200 real examples and the failure rate is below a defined threshold.

The agents that earn trust are the ones that know what they don’t know. A well-designed agent that escalates uncertain cases to a human is worth far more than one that confidently handles everything — including the cases it should have flagged.

Building something with AI agents? Learn about our autonomous agents service or book an Automation Discovery Call to discuss what it would take to build something that actually works.