Escaping the AI cost trap
It’s been a while since I last wrote something. I’m taking two weeks off in the south of France with family and friends, and our dinner table debates have circled back again and again to AI — its complexity, its potential and the massive issues with how it’s funded.
AI is cheap for end-users right now, but only because venture capital is picking up the tab. The reality is that training and inference are still expensive, and someone has to pay. For now, that “someone” is investors. But prices won’t stay low forever — the economics simply don’t stack up.
Subsidies aren’t sustainable
Billions raised by OpenAI, Anthropic, Mistral and others are spent on compute, staff and Nvidia GPUs. Consumers see subsidised pricing, but the underlying costs are enormous. The cash flow looks something like this:
- VCs / Private investors pump billions into model providers.
- Model providers (OpenAI, Anthropic, Mistral) then spend heavily on cloud credits.
- Cloud providers (Microsoft, Amazon, Google) pass that spend on to their infra.
- That infra ultimately boils down to GPUs, data centres, power and staff.
- Nvidia books record quarters as the real winner.
It’s a perfect little trickle-down chain — except instead of economics lifting all boats, it’s lifting Nvidia’s share price. The resemblance to early ride-hailing (eg. Uber) is uncanny: growth fuelled by cheap capital, not by healthy unit economics12.
Different strategies: Howie vs Jamie
Some competitors to my product Jamie, like Howie, lean on human-in-the-loop review. According to a recent podcast, they have 60+ staff whose job is to manually check simulation outputs. It’s one way to guarantee quality, but it raises questions about scalability.
We’ve taken a different approach with Jamie. Today, we rely on Claude Sonnet 4 and GPT-5 in production. Sonnet is our primary model (though it’s frustratingly unreliable at times), and GPT-5 is our fallback. Both deliver strong results.
But we don’t outsource trust to humans. Instead, we run multiple models in parallel to validate outputs, rerunning until the response is reliable. We trust the models and our customers to correct when needed, rather than outsourcing trust to a large manual review team.
We also capture metadata about every exchange. These fragments are later resurfaced and stitched into context to enrich responses. This means the assistant doesn’t just remember facts, but can reuse nuance — like timezone hints or offhand details — to build the best possible answers.
Who’s really getting paid?
Whether you’re Howie, Jamie, or a model provider, the money flows the same way:
VC dollars → model providers → cloud providers → GPUs, data centres, and talent.
Nvidia dominates here, controlling up to 95% of the AI GPU market3. Every startup dollar eventually lands in their quarterly earnings.
Owning your baseline compute
Here’s the strategic play: own your baseline, burst to the cloud when you must.
- Baseline compute: Long-lived hardware in colocation or even offices. Amortised over 3–5 years, it provides a stable, predictable cost base. AI workloads often don’t need the fastest SSDs; GPU cycles are the limiting factor.
- Burstable compute: Cloud for spikes, experimentation, and elasticity.
This hybrid approach insulates you from rising cloud inference prices while keeping flexibility where it matters.
Why we’re exploring hardware again
Open source has changed the equation. Meta’s LLaMA 3 benchmarks close to GPT-5, and we’ve tested LLaMA and Mistral directly against our Claude and GPT-5 outputs — they perform equally as well45. That’s a clear way out.
On a Mac Studio (M2 Ultra, 192 GB RAM, 1 TB SSD) we can run LLaMA inference at practical speeds. Stack 20 of these in an office and you’ve got serious, sustained throughput at a fraction of the cost of cloud inference.
Quick cost comparison: Mac Studio vs OpenAI
Mac Studio (M2 Ultra, 192 GB RAM, 1 TB SSD)
- Cost per unit (2025): ≈$7,000 USD (≈£5,400)
- 20 units: ≈$140,000 USD (≈£109,000)
- Lifespan: ~2–3 years (conservative for ML workloads)6
Equivalent OpenAI/Anthropic spend
- GPT-4 Turbo pricing (Sept 2025): ~$10 per million tokens input, ~$30 per million tokens output7
- Claude Sonnet 4 pricing: comparable to GPT-4 Turbo per token8
- A mid-scale product running ~500M tokens/month (very modest for assistants at scale) = $20,000/month
- 2 years = $480,000
That’s nearly 3x the cost of buying and running your own hardware. And the gap widens with scale.
The probable outcome for us
Unless model efficiency improves radically, prices will rise once VC subsidies dry up. But we’ve already proven open source can match frontier outputs.
That’s why the probable outcome for us is simple:
- Move away from dependency on Claude and GPT-5
- Adopt LLaMA and self-trained/fine-tuned open models
- Anchor costs with baseline hardware, burst into cloud only when needed
The future of AI won’t just belong to those with the most capital. It’ll belong to those who build smart, efficient stacks.
- Sequoia Capital, _Generative AI: A Creative New World_, 2022. Link ↩
- Financial Times, _Generative AI firms struggle with unsustainable costs_, 2024. Link ↩
- Reuters, _Nvidia’s GPU dominance under scrutiny as demand soars_, 2024. Link ↩
- Meta AI, _LLaMA 3 research publications_, 2024. Link ↩
- Hugging Face, _Open LLM Leaderboard_, 2025. Link ↩
- Apple, _Mac Studio (M2 Ultra) Pricing and Configurations_, 2023–2025. Link ↩
- OpenAI, _GPT-4 Turbo pricing_, 2025. Link ↩
- Anthropic, _Claude 3.5 and Claude Sonnet pricing_, 2025. Link ↩