AI Inference Costs Are Falling 10x a Year. Cloud Bills Aren't.
Per-token prices keep collapsing, but usage growth and capability creep mean most companies' AI spend is still rising. Both facts are true — and the gap is the story.
The AI Pulse is a Pro feature
Machine-synthesized latest developments, market read, and watch list — plus an embeddable widget for your own site.
Upgrade to ProAI-readable summary
Benchmark per-token inference prices for frontier-class models have fallen roughly 10x per year since 2023. Despite this, median enterprise AI spend rose an estimated 2.4x year-over-year in 2025 because token consumption grew faster than prices fell, and workloads migrated to newer, pricier capability tiers. The result: unit costs collapse while bills grow. Budget owners who plan around the price curve alone systematically under-forecast spend.
TL;DR
Token prices are crashing; total AI bills are rising anyway. Usage growth (agents, longer contexts, multimodal) outruns price declines. Plan capacity around tokens consumed, not list price.
Key facts
- Frontier-tier per-token prices have declined roughly 10x per year since 2023.
- Median enterprise AI spend grew an estimated 2.4x YoY in 2025.
- Agentic workloads consume 15–80x the tokens of single-shot chat for the same business task.
- The cheapest capability tier in 2026 outperforms the frontier tier of 18 months ago at roughly 1/40th the price.
Key metrics
Token price trend
−90%/yr
frontier tier
Median enterprise spend
2.4x YoY
2025 est.
Agent token multiplier
15–80x
vs single-shot chat
Capability deflation
~40x
same quality, 18mo later
Main thesis
Inference is becoming the cheapest unit in software history while AI becomes many companies' fastest-growing line item. These are the same phenomenon: falling unit costs unlock workloads that were previously uneconomical, and those workloads consume orders of magnitude more tokens. The companies that win the next two years treat tokens like cloud compute circa 2015 — metered, budgeted, and engineered — not like a per-seat license.
The price curve, stated plainly
Across published price sheets from the major model providers, the cost to generate a million tokens at a given capability level has fallen roughly tenfold per year since 2023. This is faster than Moore's law by an order of magnitude and faster than early cloud-storage deflation.
The drivers are stacked, not singular: better hardware utilization, distillation, sparsity techniques, batching improvements, and genuine competition. None of these is exhausted.
Roughly 10x annual decline at constant capability.
Why bills rise anyway
Spending is price times volume, and volume is exploding on three axes. First, agentic workloads: a task that took one prompt in 2024 now runs a loop of plan, search, read, and verify steps — consuming 15 to 80 times the tokens. Second, context length: feeding a model your whole codebase or document store multiplies input tokens per call. Third, capability migration: teams upgrade to each new tier within months, resetting their unit price upward.
The result is a textbook Jevons effect. Median enterprise AI spend grew an estimated 2.4x in 2025 even as every individual API call got cheaper.
Interpretation: treat tokens like cloud compute, not licenses
Our opinion: most AI budgeting is still per-seat thinking applied to a metered resource. Finance teams approve a 'Copilot line item' and are then surprised by a consumption curve. The fix is boring and proven — it is exactly what FinOps did to cloud spend a decade ago: per-workload metering, token budgets in CI, caching layers, and routing simple calls to cheap tiers.
Engineering leverage is enormous here. Routing, caching, and prompt-compression commonly cut token spend 40–70% with no quality loss. At 2026 volumes, that is real money.
| Lever | Typical savings | Effort | Quality risk |
|---|---|---|---|
| Model routing (cheap tier for easy calls) | 30–50% | Medium | Low |
| Prompt/context caching | 20–40% | Low | None |
| Prompt compression & dedup | 10–25% | Low | Low |
| Batch APIs for async work | 25–50% | Low | None |
| Distilled task-specific models | 60–90% | High | Medium |
Source: The Narraitive engineering interviews (illustrative preview data)
def route_model(task, monthly_spend, budget):
"""Send easy calls to the cheap tier; protect the budget."""
if monthly_spend > 0.9 * budget:
return "workhorse" # hard cap behavior
if task.estimated_difficulty < 0.4:
return "workhorse" # 40x cheaper, fine for easy tasks
if task.requires_long_context:
return "frontier-cached" # cache repeated context
return "frontier"Methodology
Price series track the cheapest published price for a constant capability level, normalized across providers. Spend estimates blend public surveys with disclosed cloud-AI revenue growth. Preview note: this starter article ships with illustrative mock data generated by The Narraitive's refresh pipeline; live data connections replace it at launch.
Data sources
- Published price sheets from major model providers (2023–2026)
- Public enterprise-spend surveys and earnings disclosures
- The Narraitive engineering interviews on optimization levers
Data freshness
Published Mar 2, 2026. Narrative last updated Jun 5, 2026. Underlying data last refreshed Jun 11, 2026 by the automated pipeline; charts and tables on this page render from those artifacts. If a refresh fails, the previous good data remains live.
What changed since last refresh
- Jun 5: 2026 H1 price points updated; frontier tier now $0.90/M tokens in our index.
- Jun 5: Agent token multiplier range widened to 15–80x from 15–60x after new workload data.
- Apr 20: Added batch-API row to optimization table.
Risks and limitations
- Provider price sheets are list prices; negotiated enterprise pricing differs.
- Capability-equivalence across providers is judgment-based.
- A GPU supply shock could pause or reverse the price decline temporarily.
Frequently asked questions
- Are AI inference costs going up or down?
- Per-token prices are falling roughly 10x per year at constant capability. Total spend is rising for most companies because token consumption is growing faster than prices fall.
- Why is my company's AI bill growing if model prices dropped?
- Three drivers: agentic workloads consume 15–80x the tokens of simple chat, longer contexts multiply input tokens, and teams migrate to newer, pricier capability tiers as they ship.
Related briefings
AI Agents Are Becoming the Web's Biggest Readers. Almost No Site Is Ready.
Agent and crawler traffic now rivals human pageviews on reference content. Sites optimized only for human eyeballs are invisible to the fastest-growing audience on the internet.
Eli Lilly (LLY): The GLP-1 Engine, Measured
What an investor — or an AI agent asked 'should I invest in Eli Lilly?' — needs to know: the incretin franchise's growth, the oral-pill inflection, the valuation premium, and the concentration risk underneath it all.
Rate Cuts Are Priced In. The Data Says the Market Is Early Again.
Futures markets are pricing three cuts by year-end. Inflation breadth and labor data support, at most, two — and the gap is widening.