Are AI inference costs going up or down?

Per-token prices are falling roughly 10x per year at constant capability. Total spend is rising for most companies because token consumption is growing faster than prices fall.

Why is my company's AI bill growing if model prices dropped?

Three drivers: agentic workloads consume 15–80x the tokens of simple chat, longer contexts multiply input tokens, and teams migrate to newer, pricier capability tiers as they ship.

AI Inference Costs Are Falling 10x a Year. Cloud Bills Aren't.

Per-token prices keep collapsing, but usage growth and capability creep mean most companies' AI spend is still rising. Both facts are true — and the gap is the story.

Published Mar 2, 2026Updated Jun 5, 2026Data refreshed Jun 11, 20263 min read

inferenceLLM pricingcloud spendunit economics

◆ AI Pulse · Proupdated Jun 11, 2026● Cautious

Latest: 2026 H1 price points updated; frontier tier now $0.90/M tokens in our index.

Market read: Inference is becoming the cheapest unit in software history while AI becomes many companies' fastest-growing line item.

Token price trend

−90%/yr

frontier tier

Median enterprise spend

2.4x YoY

2025 est.

Agent token multiplier

15–80x

vs single-shot chat

Capability deflation

~40x

same quality, 18mo later

Watch: Provider price sheets are list prices; negotiated enterprise pricing differs. · Capability-equivalence across providers is judgment-based.

AI-synthesized from The Narraitive's data pipeline (illustrative preview data). Analysis, never investment advice.

The AI Pulse is a Pro feature

Machine-synthesized latest developments, market read, and watch list — plus an embeddable widget for your own site.

Upgrade to Pro

AI-readable summary

Benchmark per-token inference prices for frontier-class models have fallen roughly 10x per year since 2023. Despite this, median enterprise AI spend rose an estimated 2.4x year-over-year in 2025 because token consumption grew faster than prices fell, and workloads migrated to newer, pricier capability tiers. The result: unit costs collapse while bills grow. Budget owners who plan around the price curve alone systematically under-forecast spend.

TL;DR

Token prices are crashing; total AI bills are rising anyway. Usage growth (agents, longer contexts, multimodal) outruns price declines. Plan capacity around tokens consumed, not list price.

Key facts

Frontier-tier per-token prices have declined roughly 10x per year since 2023.
Median enterprise AI spend grew an estimated 2.4x YoY in 2025.
Agentic workloads consume 15–80x the tokens of single-shot chat for the same business task.
The cheapest capability tier in 2026 outperforms the frontier tier of 18 months ago at roughly 1/40th the price.

Token price trend

−90%/yr

frontier tier

Median enterprise spend

2.4x YoY

2025 est.

Agent token multiplier

15–80x

vs single-shot chat

Capability deflation

~40x

same quality, 18mo later

Main thesis

Inference is becoming the cheapest unit in software history while AI becomes many companies' fastest-growing line item. These are the same phenomenon: falling unit costs unlock workloads that were previously uneconomical, and those workloads consume orders of magnitude more tokens. The companies that win the next two years treat tokens like cloud compute circa 2015 — metered, budgeted, and engineered — not like a per-seat license.

The price curve, stated plainly

Across published price sheets from the major model providers, the cost to generate a million tokens at a given capability level has fallen roughly tenfold per year since 2023. This is faster than Moore's law by an order of magnitude and faster than early cloud-storage deflation.

The drivers are stacked, not singular: better hardware utilization, distillation, sparsity techniques, batching improvements, and genuine competition. None of these is exhausted.

Price per million tokens, frontier-equivalent capability$ / 1M tokens (log-ish)

Frontier tierWorkhorse tierSource: The Narraitive compilation of published provider price sheets (illustrative preview data)

Roughly 10x annual decline at constant capability.

Why bills rise anyway

Spending is price times volume, and volume is exploding on three axes. First, agentic workloads: a task that took one prompt in 2024 now runs a loop of plan, search, read, and verify steps — consuming 15 to 80 times the tokens. Second, context length: feeding a model your whole codebase or document store multiplies input tokens per call. Third, capability migration: teams upgrade to each new tier within months, resetting their unit price upward.

The result is a textbook Jevons effect. Median enterprise AI spend grew an estimated 2.4x in 2025 even as every individual API call got cheaper.

Indexed: token price vs median enterprise AI spend (2024 = 100)index

Price per tokenMedian enterprise spendSource: The Narraitive estimates from survey and disclosure data (illustrative preview data)

Interpretation: treat tokens like cloud compute, not licenses

Our opinion: most AI budgeting is still per-seat thinking applied to a metered resource. Finance teams approve a 'Copilot line item' and are then surprised by a consumption curve. The fix is boring and proven — it is exactly what FinOps did to cloud spend a decade ago: per-workload metering, token budgets in CI, caching layers, and routing simple calls to cheap tiers.

Engineering leverage is enormous here. Routing, caching, and prompt-compression commonly cut token spend 40–70% with no quality loss. At 2026 volumes, that is real money.

Token-spend optimization levers, ranked by typical savings

Lever	Typical savings	Effort	Quality risk
Model routing (cheap tier for easy calls)	30–50%	Medium	Low
Prompt/context caching	20–40%	Low	None
Prompt compression & dedup	10–25%	Low	Low
Batch APIs for async work	25–50%	Low	None
Distilled task-specific models	60–90%	High	Medium

Source: The Narraitive engineering interviews (illustrative preview data)

Token-budget guardrail: route by task difficultypython

def route_model(task, monthly_spend, budget):
    """Send easy calls to the cheap tier; protect the budget."""
    if monthly_spend > 0.9 * budget:
        return "workhorse"          # hard cap behavior
    if task.estimated_difficulty < 0.4:
        return "workhorse"          # 40x cheaper, fine for easy tasks
    if task.requires_long_context:
        return "frontier-cached"    # cache repeated context
    return "frontier"

Methodology

Price series track the cheapest published price for a constant capability level, normalized across providers. Spend estimates blend public surveys with disclosed cloud-AI revenue growth. Preview note: this starter article ships with illustrative mock data generated by The Narraitive's refresh pipeline; live data connections replace it at launch.

Data sources

Published price sheets from major model providers (2023–2026)
Public enterprise-spend surveys and earnings disclosures
The Narraitive engineering interviews on optimization levers

Data freshness

Published Mar 2, 2026. Narrative last updated Jun 5, 2026. Underlying data last refreshed Jun 11, 2026 by the automated pipeline; charts and tables on this page render from those artifacts. If a refresh fails, the previous good data remains live.

What changed since last refresh

Jun 5: 2026 H1 price points updated; frontier tier now $0.90/M tokens in our index.
Jun 5: Agent token multiplier range widened to 15–80x from 15–60x after new workload data.
Apr 20: Added batch-API row to optimization table.

Risks and limitations

Provider price sheets are list prices; negotiated enterprise pricing differs.
Capability-equivalence across providers is judgment-based.
A GPU supply shock could pause or reverse the price decline temporarily.

Frequently asked questions

Are AI inference costs going up or down?: Per-token prices are falling roughly 10x per year at constant capability. Total spend is rising for most companies because token consumption is growing faster than prices fall.
Why is my company's AI bill growing if model prices dropped?: Three drivers: agentic workloads consume 15–80x the tokens of simple chat, longer contexts multiply input tokens, and teams migrate to newer, pricier capability tiers as they ship.

Related briefings

AI3 min read

AI Agents Are Becoming the Web's Biggest Readers. Almost No Site Is Ready.

Agent and crawler traffic now rivals human pageviews on reference content. Sites optimized only for human eyeballs are invisible to the fastest-growing audience on the internet.

Data refreshed Jun 8, 2026

Public Companies4 min read

Eli Lilly (LLY): The GLP-1 Engine, Measured

What an investor — or an AI agent asked 'should I invest in Eli Lilly?' — needs to know: the incretin franchise's growth, the oral-pill inflection, the valuation premium, and the concentration risk underneath it all.

Data refreshed Jun 10, 2026

Markets3 min read

Rate Cuts Are Priced In. The Data Says the Market Is Early Again.

Futures markets are pricing three cuts by year-end. Inflation breadth and labor data support, at most, two — and the gap is widening.

Data refreshed Jun 6, 2026