Skip to content
The Narraitive

AI Inference Costs Are Falling 10x a Year. Cloud Bills Aren't.

Per-token prices keep collapsing, but usage growth and capability creep mean most companies' AI spend is still rising. Both facts are true — and the gap is the story.

Published Mar 2, 2026Updated Jun 5, 2026Data refreshed Jun 11, 20263 min read
inferenceLLM pricingcloud spendunit economics
Share
◆ AI Pulse · Proupdated Jun 11, 2026Cautious

The AI Pulse is a Pro feature

Machine-synthesized latest developments, market read, and watch list — plus an embeddable widget for your own site.

Upgrade to Pro

AI-readable summary

Benchmark per-token inference prices for frontier-class models have fallen roughly 10x per year since 2023. Despite this, median enterprise AI spend rose an estimated 2.4x year-over-year in 2025 because token consumption grew faster than prices fell, and workloads migrated to newer, pricier capability tiers. The result: unit costs collapse while bills grow. Budget owners who plan around the price curve alone systematically under-forecast spend.

TL;DR

Token prices are crashing; total AI bills are rising anyway. Usage growth (agents, longer contexts, multimodal) outruns price declines. Plan capacity around tokens consumed, not list price.

Key facts

  • Frontier-tier per-token prices have declined roughly 10x per year since 2023.
  • Median enterprise AI spend grew an estimated 2.4x YoY in 2025.
  • Agentic workloads consume 15–80x the tokens of single-shot chat for the same business task.
  • The cheapest capability tier in 2026 outperforms the frontier tier of 18 months ago at roughly 1/40th the price.

Key metrics

Token price trend

−90%/yr

frontier tier

Median enterprise spend

2.4x YoY

2025 est.

Agent token multiplier

15–80x

vs single-shot chat

Capability deflation

~40x

same quality, 18mo later

Main thesis

Inference is becoming the cheapest unit in software history while AI becomes many companies' fastest-growing line item. These are the same phenomenon: falling unit costs unlock workloads that were previously uneconomical, and those workloads consume orders of magnitude more tokens. The companies that win the next two years treat tokens like cloud compute circa 2015 — metered, budgeted, and engineered — not like a per-seat license.

The price curve, stated plainly

Across published price sheets from the major model providers, the cost to generate a million tokens at a given capability level has fallen roughly tenfold per year since 2023. This is faster than Moore's law by an order of magnitude and faster than early cloud-storage deflation.

The drivers are stacked, not singular: better hardware utilization, distillation, sparsity techniques, batching improvements, and genuine competition. None of these is exhausted.

Price per million tokens, frontier-equivalent capability$ / 1M tokens (log-ish)
Frontier tierWorkhorse tierSource: The Narraitive compilation of published provider price sheets (illustrative preview data)

Roughly 10x annual decline at constant capability.

Why bills rise anyway

Spending is price times volume, and volume is exploding on three axes. First, agentic workloads: a task that took one prompt in 2024 now runs a loop of plan, search, read, and verify steps — consuming 15 to 80 times the tokens. Second, context length: feeding a model your whole codebase or document store multiplies input tokens per call. Third, capability migration: teams upgrade to each new tier within months, resetting their unit price upward.

The result is a textbook Jevons effect. Median enterprise AI spend grew an estimated 2.4x in 2025 even as every individual API call got cheaper.

Indexed: token price vs median enterprise AI spend (2024 = 100)index
Price per tokenMedian enterprise spendSource: The Narraitive estimates from survey and disclosure data (illustrative preview data)

Interpretation: treat tokens like cloud compute, not licenses

Our opinion: most AI budgeting is still per-seat thinking applied to a metered resource. Finance teams approve a 'Copilot line item' and are then surprised by a consumption curve. The fix is boring and proven — it is exactly what FinOps did to cloud spend a decade ago: per-workload metering, token budgets in CI, caching layers, and routing simple calls to cheap tiers.

Engineering leverage is enormous here. Routing, caching, and prompt-compression commonly cut token spend 40–70% with no quality loss. At 2026 volumes, that is real money.

Token-spend optimization levers, ranked by typical savings
LeverTypical savingsEffortQuality risk
Model routing (cheap tier for easy calls)30–50%MediumLow
Prompt/context caching20–40%LowNone
Prompt compression & dedup10–25%LowLow
Batch APIs for async work25–50%LowNone
Distilled task-specific models60–90%HighMedium

Source: The Narraitive engineering interviews (illustrative preview data)

Token-budget guardrail: route by task difficultypython
def route_model(task, monthly_spend, budget):
    """Send easy calls to the cheap tier; protect the budget."""
    if monthly_spend > 0.9 * budget:
        return "workhorse"          # hard cap behavior
    if task.estimated_difficulty < 0.4:
        return "workhorse"          # 40x cheaper, fine for easy tasks
    if task.requires_long_context:
        return "frontier-cached"    # cache repeated context
    return "frontier"

Methodology

Price series track the cheapest published price for a constant capability level, normalized across providers. Spend estimates blend public surveys with disclosed cloud-AI revenue growth. Preview note: this starter article ships with illustrative mock data generated by The Narraitive's refresh pipeline; live data connections replace it at launch.

Data sources

  • Published price sheets from major model providers (2023–2026)
  • Public enterprise-spend surveys and earnings disclosures
  • The Narraitive engineering interviews on optimization levers

Data freshness

Published Mar 2, 2026. Narrative last updated Jun 5, 2026. Underlying data last refreshed Jun 11, 2026 by the automated pipeline; charts and tables on this page render from those artifacts. If a refresh fails, the previous good data remains live.

What changed since last refresh

  • Jun 5: 2026 H1 price points updated; frontier tier now $0.90/M tokens in our index.
  • Jun 5: Agent token multiplier range widened to 15–80x from 15–60x after new workload data.
  • Apr 20: Added batch-API row to optimization table.

Risks and limitations

  • Provider price sheets are list prices; negotiated enterprise pricing differs.
  • Capability-equivalence across providers is judgment-based.
  • A GPU supply shock could pause or reverse the price decline temporarily.

Frequently asked questions

Are AI inference costs going up or down?
Per-token prices are falling roughly 10x per year at constant capability. Total spend is rising for most companies because token consumption is growing faster than prices fall.
Why is my company's AI bill growing if model prices dropped?
Three drivers: agentic workloads consume 15–80x the tokens of simple chat, longer contexts multiply input tokens, and teams migrate to newer, pricier capability tiers as they ship.

Related briefings