What It Actually Costs to Build an AI System (And Why Estimates Are Usually Wrong)

You ask a vendor what it costs to build an AI system. They give you a range. You pick a number in the middle and build it into your budget. Six months later, you've spent 40% more than that number and you're still not fully in production.

This isn't unusual. It's actually the norm, and it's not always vendor dishonesty. A lot of it comes from a structural problem: the things that make AI projects expensive are not the parts that show up in a proposal. They're buried in data work, integration complexity, and infrastructure decisions that seem minor until they aren't.

This post breaks down the real cost drivers. Not to give you a price list - every project is different - but to give you the framework to ask better questions before you commit, and to recognise when an estimate is suspiciously optimistic.

The Part Everyone Underestimates: Data Preparation

Ask any experienced AI engineer what kills timelines and budgets, and they'll tell you the same thing: data.

Here's the thing most companies don't know going in. They think they have data. They do - in a spreadsheet here, a CRM there, an accounting system, some PDFs on a shared drive, transaction logs that haven't been touched in two years. The problem isn't volume. It's that this data was never collected with an AI system in mind. It has inconsistencies, gaps, duplicate records, formatting that changes halfway through because someone switched systems in 2019.

Before a model can learn from your data, it has to be cleaned, standardised, and structured. Before that can happen, someone has to understand the data well enough to know what "clean" means in your context - which fields matter, what the outliers represent, what the missing values actually indicate.

In our experience, data preparation consumes anywhere from 20% to 50% of project time on ML and fine-tuning work. On the shorter end when a company has a mature data infrastructure and dedicated data ownership. On the longer end when data lives across five systems and nobody has a complete picture of it.

The vendors who don't surface this early either haven't dug into your data yet, or they're hiding the cost in later project phases. Ask any vendor, before you sign: what does your data assessment process look like, and what happens if we find the data isn't in the shape we expected?

Model Choice: The Difference Between a Line Item and a Monthly Bill

Not all AI systems are equally expensive to run. The choice of which model sits at the core of your system determines your per-call cost, your latency, and your dependency on third-party providers - and these decisions compound over time.

There are roughly three tiers of model choice, each with different economics.

Frontier API models - GPT-4, Claude, Gemini - are the easiest to build with. No infrastructure to manage, strong baseline performance, and new capabilities available without rebuilding. The cost is per token: every input and output costs money. For a low-volume internal tool, this is usually fine. For a high-volume customer-facing system processing thousands of calls or documents a day, those costs add up fast and need to be modelled carefully upfront.

Open-source models running on your own infrastructure - Llama, Mistral, Phi - eliminate the per-token fee and keep your data off third-party servers. The trade-off is you're paying for compute instead of API calls: cloud GPU instances, deployment infrastructure, maintenance. This becomes cost-effective at volume and gives you more control, but it requires more engineering to set up and run reliably.

Fine-tuned models sit between the two. You take an open-source base model, train it on your proprietary data, and deploy it yourself. Higher upfront build cost, lower ongoing cost, and often better performance on the specific task you trained it for than a general frontier model. For high-volume, task-specific systems - a customer service voice agent, a document classifier, an extraction pipeline - this is frequently the right call economically.

What to ask any vendor: what model are you building on, and have you modelled the monthly infrastructure or API costs at our expected volume? If they haven't, the estimate is incomplete.

Integration Work: The Hidden Cost Inside Every Project

Standalone AI systems are rare. Almost every real deployment needs to connect to your existing stack - your CRM, your ERP, your database, your internal tools. That integration work is where estimates most reliably go wrong.

The pattern is predictable. The initial proposal accounts for the API call that connects the AI system to your CRM. It doesn't fully account for the fact that your CRM data is inconsistently formatted, that the API has rate limits you'll hit in production, that three fields the system depends on are populated differently across your regional offices, or that authentication requires a flow your CRM's documentation describes incorrectly.

Integration bugs are legitimate and common. They're not a sign of incompetence. But they have real time costs, and those costs land in the project whether or not the estimate saw them coming.

The honest proxy question: how many systems does this project need to connect to, and how well-documented are those systems' APIs? Every external dependency is a risk surface. A workflow automation connecting two well-documented SaaS tools takes days to integrate. A custom ERP with 15-year-old documentation takes weeks.

Infrastructure: Cheap to Start, Expensive at Scale

Development environments are cheap. Production environments are not.

The AI system your vendor demos runs on minimal compute. Your production system - handling real users, real data volumes, needing uptime guarantees, requiring logging and monitoring and error handling - costs more. How much more depends on what you're building.

A lightweight RAG-powered chatbot on a managed cloud service might cost a few hundred dollars a month to run. An always-on AI voice system handling hundreds of concurrent calls needs real-time audio processing infrastructure, failover capacity, and active monitoring - the monthly bill looks meaningfully different.

This matters because vendors often scope the build cost but leave ongoing infrastructure costs implicit. You should ask, before you sign: what will this cost to run monthly at our expected volume in year one? And what does that number look like if usage doubles?

For our own projects, we include an infrastructure cost estimate as part of the proposal. It shouldn't be an afterthought.

Human-in-the-Loop: Where Smart Systems Get More Expensive

Fully autonomous AI systems look great in demos. In production, almost every meaningful application has a category of output that needs a human review before it acts.

Designing that review layer has a cost. It means building a reviewer interface, creating a queue management system, defining escalation thresholds, and deciding how the system handles cases waiting for human approval. For systems like automated purchase orders, loan pre-screening, or customer service escalations, this is not optional - it's what makes the system trustworthy enough to deploy at all.

The trap is treating the human-in-the-loop layer as a late-stage refinement. It should be in the architecture from day one, because retrofitting it into a system that wasn't designed for it is significantly more expensive than building it in initially.

Ask your vendor: where does this system require human review, and is the interface for that review included in this proposal?

Ongoing Maintenance: The Cost That Outlives the Build

Most proposals are scoped around delivery. The ongoing cost is undersold, sometimes deliberately, because it's a separate conversation that happens later.

AI systems require maintenance in ways traditional software doesn't. Models drift as the real-world inputs they receive diverge from training data. Third-party APIs update. The underlying LLM you built on releases a new version that behaves slightly differently. The business process the system was built around changes.

For a well-built system with a clean architecture, maintenance is manageable. A monthly retainer covering monitoring, issue resolution, and periodic model evaluation is the honest picture. For a system built quickly with technical debt baked in, maintenance costs scale with complexity.

The benchmark question: what does your standard post-launch support arrangement look like, and who owns the system if the model the system depends on is deprecated?

What Different Project Types Actually Look Like

Rather than a price list - which would be misleading without knowing your systems and data - here's a rough sense of what different categories of work involve.

Workflow automation connecting several SaaS tools with standard APIs is the fastest and most predictable category. Data prep is minimal. Integration risk is low with well-documented platforms. These projects are scoped in weeks, not months, and ongoing maintenance is light.

AI voice systems involve more moving parts - telephony infrastructure, voice synthesis, speech recognition tuning, CRM integration, escalation design. They take longer and have more production variables than automation work, but the scope is well-understood and the cost drivers are visible upfront.

Custom ML systems - demand forecasting, classification pipelines, recommendation engines - are where data preparation costs dominate. If your data is clean and accessible, the model work is often the faster part. If it isn't, data pipeline work will be the bulk of the project.

LLM and RAG systems for internal knowledge bases or document intelligence are usually mid-range in build cost but carry ongoing inference costs that scale with usage. Fine-tuning work adds upfront cost and reduces long-term per-call cost.

Why Estimates Are Wrong

They're usually wrong for one of three reasons.

The first is genuine uncertainty. Until someone has looked at your data and your systems, any number is a guess. Good vendors price in discovery time to validate assumptions before committing to a full scope.

The second is optimism bias. Vendors want the work. Estimates that surface every potential complication don't win proposals. The safest way to protect yourself is to ask specifically about data quality risk, integration complexity, and what triggers a scope change conversation.

The third is deliberate low-balling, with the expectation of change orders. This is the least common but the most costly. The signal is an estimate that doesn't ask hard questions about your data or existing systems before arriving at a number. Proposals that appear without a discovery process attached to them should be scrutinised carefully.

A trustworthy vendor will tell you what they don't know yet, and what it would take to find out. That's a feature, not a hedge.

If you want to understand what a system like yours would realistically cost - what the data situation looks like, where the integration risks are, and what ongoing maintenance should budget for - get in touch. We scope projects honestly, which sometimes means telling you the number is higher than you hoped. That's a better conversation to have before you start than after.