AI Automation
What AI actually costs to run in production
Quick answer: the model API bill for most production AI features sits between $50 and $2,000 a month. The full system around it — evaluations, observability, guardrails, fallbacks, and the engineering to build them — is where the real cost lives, and it typically dwarfs the inference bill by 10×.
A common conversation: a business owner has played with ChatGPT or Claude, seen what it can do, and is convinced their team should be using AI for some genuinely sensible task — lead qualification, document review, customer support triage, content drafting. They want to know what it’ll cost to put into production.
The honest answer is two answers. The cost of running the model is one number, often surprisingly low. The cost of running it well enough to trust is another number, often the surprising one.
Most of the “our AI project blew the budget” stories we hear are about the gap between those two numbers.
The visible cost: tokens
The line that most people start from is the per-token cost of the model itself. As of mid-2026, public pricing for the models we most commonly deploy sits in this rough range:
- Lightweight / classification-grade models: roughly $0.10–$0.50 per million input tokens
- Mid-range general-purpose models (Claude Sonnet, GPT-4o-class): roughly $1–$5 per million input tokens, $5–$15 per million output tokens
- Frontier models (Claude Opus, GPT-5-class): $10–$30+ per million input tokens, more on output
Live pricing for the major providers sits at openai.com/api/pricing and anthropic.com/pricing — both update regularly, and both have come down meaningfully every year for the same model class.
A token is roughly ¾ of a word. A typical customer-support email might be 200–400 tokens in, 100–200 tokens out. Even at frontier-model pricing that’s a fraction of a cent per email.
Multiplied across a real workload, the maths often looks more reassuring than people expect:
- 1,000 customer support triage decisions a day, mid-range model: roughly $1–$5/day
- 100,000 product description drafts a month, mid-range model: roughly $50–$200/month
- A handful of frontier-model legal-summary requests a day: $1–$10/day
If that were the whole bill, AI would be straightforwardly cheap.
The invisible cost: everything around the model
The reason it isn’t the whole bill is that the model is only one component of a system that needs to be trusted. The other components add up.
Retrieval and embeddings
For anything where you need the model to answer based on your data — your knowledge base, your product catalogue, your past tickets — you’re running RAG (retrieval-augmented generation). Anthropic’s contextual-retrieval write-up is the clearest current reference on getting RAG quality right; the engineering effort it describes is real, and it’s most of the cost. RAG involves:
- Embedding all the source documents (one-off, plus ongoing as content changes)
- Storing embeddings in a vector database (a fixed monthly cost, plus scaling with data volume)
- Embedding every incoming query (per-call, but cheap)
- Maintaining the pipeline that keeps embeddings fresh as documents change
Per-token, embeddings are very cheap (often $0.02–$0.10 per million tokens). The cost shows up in the orchestration: the small dedicated infrastructure to keep the index current, the engineering to handle deletes and updates, the monitoring on retrieval quality.
Evaluation
This is the line item most teams underestimate.
To trust an AI workflow in production, you need an evaluation suite — a set of representative real inputs and the outputs you’d consider correct. Every prompt change, every model upgrade, every new edge case you discover, you re-run the suite to confirm nothing regressed.
That suite has to be:
- Built from real production data, not synthetic
- Maintained as the use case evolves
- Run on every change
- Reviewed by a human when results are ambiguous
The engineering hours to build a good eval suite for a single AI feature are typically 20–80, depending on complexity. The hours to maintain it across a year tend to be similar. That’s a significant cost relative to a model bill measured in the dollars-per-day.
Anthropic’s guide on building evaluations is the best free starting point we know of for teams new to this.
Guardrails
In any AI system that takes external input, you need protections against:
- Prompt injection (a customer pasting “ignore previous instructions and…” into a support ticket)
- Hallucinated facts being treated as authoritative
- The model going off-topic, off-brand, or off-policy
- Confidential data being unintentionally surfaced
Guardrails are a combination of input filtering, output validation, structured response schemas, and (often) a second model or rule-based check on the first model’s output. They cost engineering time to build, and a bit more inference cost per call (because you’re running validation passes on top of the main call).
Observability
A production AI system needs the same observability surface as any other production system, plus a few AI-specific signals:
- Per-call latency, token counts, and cost
- Retrieval quality (was the right document fetched?)
- Output format compliance (did the model return the schema you expected?)
- Drift over time (is the model’s behaviour changing as the vendor updates it under you?)
This is real infrastructure: structured logs, dashboards, alerts. Not optional, and not free.
The fallback path
The most expensive AI systems are the ones with no fallback. When the model is uncertain, slow, rate-limited, or wrong, what happens? The thoughtful answers are:
- Escalate to a human
- Fall back to a deterministic rule
- Return a graceful “we can’t answer that right now”
- Retry with a different model or prompt
Designing the fallback path is real engineering. So is the queueing and routing layer underneath it. Done well it costs maybe 20–40% on top of the “happy path” build. Skipped, and the first time the model has a bad day, customers see whatever the system felt like producing.
The third cost: getting it wrong
This one’s harder to put a number on, but it’s the one that bites hardest.
A wrong answer from an AI system is structurally different from a wrong answer from a human. Humans tend to know when they don’t know. Models — especially older or smaller ones — will confidently produce a plausible-sounding answer to a question they’ve never seen, with no internal flag that they’re guessing.
The cost of a wrong answer depends on the use case:
- For a draft email a human will review before sending: low. Worth automating.
- For a customer support response sent unattended: medium. Worth automating with confidence thresholds.
- For a product recommendation that affects a purchase decision: medium-to-high.
- For anything legal, medical, financial, or compliance-related: usually high enough that we’d recommend keeping a human in the loop regardless of cost.
A useful framing for the cost of being wrong: how much would the business lose, in revenue or trust, if 1% of the AI’s outputs were silently incorrect for a quarter? If the answer is “we’d notice and fix it,” great. If the answer is “we’d only find out at the end of the year,” the system needs more guardrails before it goes live.
When AI is genuinely cheap
The AI features whose unit economics work cleanly tend to share a few traits:
- A clear, narrow task with a checkable output
- A structured response (JSON, classification, score) rather than freeform prose
- A volume that justifies the engineering investment
- A use case where being “right 95% of the time” is more valuable than the cost of human review on the other 5%
Examples that pay off cleanly: classifying inbound emails into categories, extracting structured data from documents, drafting first-pass responses for human review, summarising long content for skim-reading, scoring leads against a rubric.
When AI is more expensive than it looks
The features that consistently surprise on cost:
- Long-context tasks (large prompts add up fast on per-token pricing)
- Real-time / low-latency tasks (you’re paying for premium tier and engineering complexity)
- Anything that needs the “best” answer rather than “a good enough” answer (you’re into frontier-model pricing for every call)
- Tasks where the wrong answer is expensive (you’re paying for evals, guardrails, and human review)
- Use cases that don’t cleanly batch (you’re paying for orchestration overhead per call)
[INSERT: a real example — e.g. “We costed an AI [use case] for a [client]. The model bill came in at about $X/month. The full system — evals, observability, fallback — was a $Y build and $Z/month to run. The model was 10% of the total cost.”]
A back-of-envelope budget
For a serious production AI feature for a mid-size business, a rough first-pass budget tends to look like:
- Model inference: $50–$2,000/month depending on volume
- Vector DB / embeddings infrastructure: $50–$500/month
- Observability and logging: $50–$300/month
- Engineering build (one-off): $30,000–$150,000 depending on complexity
- Engineering maintenance: 0.1–0.5 of an FTE
The model bill is rarely the constraint. The engineering investment is.
That’s the difference between an AI feature that works in a demo and an AI feature that runs unattended in production. The first costs almost nothing. The second is a real software project with real economics — and worth doing properly when the value justifies it, and worth not doing when it doesn’t.
If you’re weighing up an AI build, start a project and we’ll do the cost model with you. We’ll tell you when the maths works and when it doesn’t.
Common questions
How much does it cost to run AI in production? For most business automations, the model inference itself is $50–$2,000 per month. The full production system — evaluations, vector storage, observability, guardrails, fallbacks — typically multiplies the inference bill by 10× in operational cost, plus a one-off engineering build of $30,000–$150,000 depending on complexity.
Is the OpenAI / Claude API expensive? Per-call, no. A typical customer-support email round-trip costs a fraction of a cent. The cost surprise comes from the orchestration around the call — evals, observability, retrieval — not the call itself.
What’s the cheapest way to run AI for a small business? Pick a narrow, clearly-bounded task with a checkable output (classification, extraction, scoring), use a mid-range model, structured outputs, and accept that the first version doesn’t need full observability infrastructure. Many useful AI features can ship for under $20,000 all-in if scoped tightly. We’ve written about which ones actually pay off.
Is RAG expensive? Per-call, no — embeddings are pennies. The expense is the engineering and infrastructure around the retrieval index: keeping it fresh, monitoring quality, handling document updates and deletions. Done well it costs $50–$500/month in infrastructure plus a meaningful chunk of the engineering build.
Should we use Claude, GPT, or Gemini? Different models suit different tasks. Claude tends to be strong on long-form reasoning and instruction-following; GPT on tool use and ecosystem maturity; Gemini on price-per-token at scale and multimodal. The honest answer for most production builds is “the one whose strengths line up with your task — verified by running your eval suite against each.”
When is AI not worth the cost? When the task tolerates only zero error and the cost of being wrong is high (legal, medical, financial decisions); when volume is too low to justify the engineering investment; when a deterministic rule-based system would do the job for a tenth of the price. We turn down AI work in all three categories regularly.
More reading
Why integrations break in production (and what to design for)
Every integration that "just calls an API" eventually breaks. The five places they fail first, and the design patterns that keep them running unattended.
StrategyThe hidden costs of SaaS once your business is established
The per-seat licence is the visible cost. Integration tax, lock-in, configuration drift, and the seat tax at scale are the SaaS costs no one quotes up front.
StrategyRed flags to watch for when hiring a development agency
The signals that separate agencies who deliver from agencies who disappear after the deposit. Twelve practical red flags from twenty-plus years of seeing them.