AI Automation
AI automations that actually ship — not demo-ware
Quick answer: the gap between an AI demo and a production AI system is mostly engineering — structured outputs instead of freeform text, retrieval grounding instead of model memory, confidence thresholds with human fallback, and an evaluation suite that catches regressions on every change. Without these, the demo works and production breaks. With them, the AI runs unattended.
Anyone can build an AI demo in 30 minutes. Building an AI automation that survives real customer data, edge cases, hallucinations, rate limits, and the moment someone asks the AI a question it has never seen — that takes engineering.
We’ve shipped a number of these now using OpenAI, Anthropic Claude, and Google Gemini APIs. Some patterns travel well across projects.
The four things that separate demos from production
1. Structured outputs over freeform text. If you can pin the output to a schema (JSON, function calling, structured generation) — do. Freeform text is how hallucinations sneak into production.
2. Retrieval grounding for factual claims. For anything where being wrong matters, ground the AI in real data via RAG (retrieval-augmented generation). Don’t ask the model to know — ask it to read.
3. Confidence thresholds with human fallback. Real automations need a path for low-confidence answers. Either escalate to a human, or fall back to a deterministic default. Never let the AI guess silently.
4. Evaluations, not vibes. Vibes-based testing (“seems good, ship it”) is how production AI breaks. Build an eval suite from real examples. Run it on every prompt change. Track regressions. Anthropic’s guide on evaluations is a good starting point if you’re new to this.
What we never use AI for
We’ve turned down AI work where the risk profile doesn’t suit:
- Medical or legal advice generation. Liability is real.
- Final approvals on financial transactions. Always keep a human in the loop.
- Anywhere the cost of being wrong is severe and the upside is marginal. Just because you can automate something doesn’t mean the math favours it.
For lower-risk, higher-value AI work — customer support routing, document classification, content drafting, lead qualification — we build the kind of production AI that runs unattended and gets cited in your error budgets, not your incident reports.
The cost question
A common surprise: the inference cost of running an AI workflow well in production is usually not the constraint. The engineering cost — building evaluations, observability, guardrails, fallbacks — is what makes or breaks the unit economics.
Budget accordingly. And if you’re running AI on top of an existing platform like WordPress, WooCommerce, or GoHighLevel, the integration plumbing costs more than the AI itself most of the time.
If you’d like a realistic read on whether an AI automation is worth building for your specific use case — start a project. We’ll tell you honestly whether it’ll work, what it’ll cost to run, and what could go wrong.
Common questions
What’s the difference between an AI demo and an AI in production? A demo handles the curated test inputs you’ve prepared. Production handles the long tail of real inputs — edge cases, malformed data, prompt-injection attempts, ambiguity, and inputs the model has never seen. The engineering between the two — evaluations, structured outputs, guardrails, fallbacks, observability — is most of the work. We’ve covered where AI breaks in detail.
How much does production AI cost? The model API bill for most business automations sits between $50 and $2,000 per month. The full system around it — evaluations, observability, guardrails, fallbacks, and the engineering to build them — typically dwarfs the inference bill by 10×. We’ve broken down the full AI cost picture elsewhere.
Should I use AI for customer-facing decisions? Depends on the cost of being wrong. For routing, classification, drafting, summarising — yes, with confidence thresholds and a path to human escalation. For final decisions on refunds, contracts, medical or legal questions, hiring — no, AI as a draft-only tool with human approval. The pattern that survives is keeping a human in the loop.
Which AI model should I use? Different models suit different tasks. Claude tends to lead on instruction-following and analytical work; GPT on tool use and ecosystem maturity; Gemini on cost at scale and multimodal. The right answer is the one whose strengths line up with your task — verified by running your evaluation suite against each. Compared in detail: Claude vs GPT vs Gemini.
How long does an AI build take? For a focused production-grade pilot covering one specific workflow: typically 4–10 weeks. Faster builds usually skip the eval suite, observability, or fallback path — and tend to work in demos but break in production. Building the engineering layer properly is what separates the systems that ship from the ones that get rebuilt six months later.
Is RAG always required for AI projects? Only when the AI needs to answer using your specific data — knowledge base, product catalogue, policy documents. For tasks like classification, drafting, extraction, summarisation that don’t require factual recall, RAG isn’t needed. We’ve explained RAG in plain language for owners deciding whether their use case needs it.
More reading
What AI actually costs to run in production
AI demos are cheap. Production is not. Where the money actually goes when you ship an AI feature, and how to size the engineering investment around the model.
IntegrationsWhy integrations break in production (and what to design for)
Every integration that "just calls an API" eventually breaks. The five places they fail first, and the design patterns that keep them running unattended.
StrategyThe hidden costs of SaaS once your business is established
The per-seat licence is the visible cost. Integration tax, lock-in, configuration drift, and the seat tax at scale are the SaaS costs no one quotes up front.