Skip to content

AI Automation

Where AI breaks: the prompts you can't deploy to production

Andrew Roper · · 7 min read

Quick answer: current LLMs reliably break in five categories — tasks that require precise factual accuracy without retrieval grounding, tasks where the cost of being subtly wrong is high, tasks that require true reasoning over long-context multi-step logic, tasks that depend on real-time information the model doesn’t have, and tasks where consistency across calls is critical. If your AI feature falls into one of these, the demo will work; production won’t.

A common moment in early-stage AI projects: the demo works flawlessly. The team is excited. We’ve seen this several times now and our reaction has stopped being excitement. It’s a question: what happens when this is wrong, and how often will it be?

Some AI use cases survive that question well. Others don’t. The patterns are predictable.

For a comprehensive technical view of LLM failure modes, the OWASP Top 10 for LLM Applications catalogues the categories formally — the patterns below are the ones we hit most often in production builds.

The five categories where AI reliably fails

1. Precise factual accuracy without retrieval grounding

The fundamental thing to understand about LLMs: they generate plausible text, not verified facts. Asked a question they don’t know the answer to, they don’t say “I don’t know.” They produce a confident-sounding answer that might be right and might not.

This breaks immediately for any use case where a wrong answer is a significant problem:

  • Looking up customer details from a database
  • Quoting prices from a product catalogue
  • Citing facts from internal policy documents
  • Stating legal, medical, or financial information

The fix is RAG (retrieval-augmented generation): the model is given the relevant documents at query time and instructed to answer only from those documents. With RAG done well, factual accuracy approaches the underlying source material’s accuracy. Without RAG, the model is essentially guessing — sometimes correctly, often not.

If your AI feature requires accurate retrieval of specific facts, RAG isn’t optional. Building it without retrieval is a demo that’ll fail in production reliably enough to undermine trust in the whole system. We’ve written about what AI actually costs in production — RAG is most of where that cost goes.

2. High-cost-of-error use cases

Even with retrieval grounding, models can be wrong. The question is whether the cost of occasional wrongness is tolerable.

The use cases where it’s usually not:

  • Medical advice or diagnosis. Liability, safety, regulation.
  • Legal advice or contract drafting. Even “help draft this agreement” has stakes.
  • Final approval on financial transactions. Reversing a wrong transaction has real cost.
  • Compliance-relevant decisions. Regulators don’t accept “the AI did it” as a defence.
  • Hiring or HR decisions. Bias, fairness, legal exposure.
  • Customer-facing communication on sensitive topics. Complaints, refunds, account closures.

We turn down AI work in all of these regularly. Not because the technology can’t partially help — it can — but because the cost-benefit doesn’t favour automation when getting it wrong is severe.

The honest pattern for high-stakes use cases: AI as a draft for a human, not AI as the decision-maker. The model produces a first pass, a human reviews and either accepts, edits, or rejects. The quality benefit is real; the liability stays where it should.

3. Multi-step reasoning chains

Current models are competent at one or two reasoning steps. They’re unreliable at five or ten.

What this means in practice:

  • “Look up the customer’s account, find their last order, calculate a partial refund based on policy X” — doable, but with a non-trivial error rate
  • “Compare these three options across cost, performance, and risk dimensions, weight the dimensions by stated priorities, and recommend one with reasoning” — demo-quality, not production-quality
  • “Plan a multi-step workflow with conditional branches based on what each step returns” — reliably wrong somewhere

The reliability decay is non-linear. A 95% per-step accuracy across ten steps is roughly 60% end-to-end accuracy. Across twenty steps, around 36%. The arithmetic is brutal.

The fix isn’t a smarter model (though it helps). The fix is structural: break the task into single-step calls with explicit outputs, validate each output, build the multi-step logic in your code, and use the model only for the individual judgments where it actually shines.

The pattern: AI as a sub-routine, code as the orchestrator. Not AI as the orchestrator.

4. Real-time / current-information use cases

Models have a training cutoff. They don’t know what happened yesterday. They don’t know your inventory level right now. They don’t know whether your shop is open today.

This is fixable with tool use: the model is given functions it can call to look up real-time information. Done well, this works. Done badly, it’s fragile.

What goes wrong:

  • The model calls the tool incorrectly (wrong arguments, wrong tool for the job)
  • The tool returns information the model misinterprets
  • The model decides not to call the tool when it should
  • The tool fails, and the model invents a plausible-sounding answer rather than reporting the failure

The pattern that survives: explicit, narrow tools with strict schemas; clear instructions about when each tool is required; output validation that fails loudly when the model invents an answer instead of using the tool.

OpenAI’s function calling and Claude’s tool use both work well when the tool surface is small and well-described. They both struggle when the model is asked to compose tools or to decide between many similar-looking options.

5. Cross-call consistency

For some use cases, the model needs to behave the same way every time. Pricing decisions. Policy adherence. Brand voice across thousands of customer interactions.

LLMs are not deterministic by default. Even at temperature 0 (which most providers offer), small differences in input phrasing produce different outputs. Across thousands of calls, the variance is real.

Where this matters:

  • Customer-facing communications where consistent tone is required
  • Pricing or quoting where small variations matter
  • Classification tasks where the same input must always get the same label
  • Any use case where reproducibility is part of the contract

The fix isn’t lower temperature alone. The fix is structural: pin the model version (vendors do roll out updated models silently); use tight, structured outputs (JSON with strict schema); apply post-processing that snaps outputs to allowed values; and run extensive evaluations to catch drift.

For pricing and brand-critical applications specifically, we usually recommend AI for the draft and a deterministic system for the final output. The deterministic layer enforces consistency where it matters, and the AI handles the parts where flexibility is fine.

What this leaves

The use cases where current AI works reliably tend to share traits:

  • Narrow, well-defined task with checkable output
  • Tolerance for occasional error (or human review where errors matter)
  • Single-step or low-step reasoning
  • Information that’s either in the prompt or available via a small set of clear tools
  • No requirement for cross-call consistency at high precision

Examples that survive production:

  • Classifying inbound emails into categories
  • Extracting structured data from documents
  • Drafting first-pass responses for human review
  • Summarising long content for skim-reading
  • Scoring leads against a rubric
  • Answering questions over a well-curated knowledge base via RAG
  • Routing tickets to the right team

These are the AI features that ship and stay shipped. The list is shorter than the marketing suggests, but the items on it are genuinely valuable.

How to tell, before you build

A short test for any AI feature you’re considering:

  1. What’s the cost of being wrong on a specific call? If meaningful, plan for human review or don’t automate it.
  2. How many reasoning steps does this require? If more than two or three, restructure the architecture so each step is a separate call.
  3. Does this depend on facts the model doesn’t inherently know? If yes, RAG is mandatory.
  4. Does this need to be the same answer every time for the same input? If yes, AI alone won’t deliver it — combine with deterministic post-processing.
  5. Can you write an evaluation suite that grades outputs? If yes, the project is feasible. If not, you’re flying blind.

If the answers don’t pass the test, the right call is sometimes “don’t build it” rather than “build it anyway and hope.” That’s the answer we give clients regularly when we scope AI work honestly.

Common questions

What can AI not do reliably? Five categories: precise factual recall without retrieval grounding, decisions where being wrong is costly, multi-step reasoning chains, real-time information lookup without good tool design, and high-precision consistency across calls. Anything in these categories needs structural mitigation, not just better prompting.

Why does my AI demo work but the production version fail? Demos are tested with curated inputs. Production sees the long tail of real inputs — edge cases, malformed inputs, attacks, ambiguity — that the demo never saw. The gap between demo and production is the gap between a single test and a million, and it’s where most AI projects underestimate the engineering required.

How do I make AI more accurate? The leverage points: ground factual answers in retrieval (RAG), break multi-step tasks into single-step calls, use structured outputs (JSON) over freeform text, build evaluations that grade real outputs against expected, and add deterministic post-processing where consistency matters.

Should I use the most powerful model available? Often, no. Frontier models are 10–30× the cost of mid-range models for many tasks. The right model is the smallest one that meets your accuracy requirements on your evaluation suite. Run the evals against several models and pick the cheapest one that passes.

When is AI the wrong answer entirely? When the task tolerates only zero error and the cost of being wrong is high (legal, medical, regulated financial decisions); when a deterministic rule-based system would do the job for a fraction of the price; when the volume is too low to justify the engineering investment; or when the task fundamentally requires capabilities current models lack (long-horizon planning, novel reasoning, true verification).

If you’ve got an AI feature that works in testing but you’re not sure how it’ll behave in production, start a project and we’ll review it. The cheapest moment to discover an architectural mismatch is before launch, not after.

Let’s build something

The right system,
built once, properly.

If your business is ready to scale beyond what off-the-shelf tools can support — we should talk.