The pace of AI, in perspective

Most takes on AI progress are either breathless ("everything changes tomorrow") or dismissive ("it's all hype"). Neither survives contact with the numbers. The useful view is the slope — how fast capability is climbing, how fast cost is falling, and what the gap between the two is doing to anyone building on top.

The slope, roughly

Frontier training compute has been roughly doubling every 6 months since 2019. That is not a marketing line — it is what you get when you plot publicly disclosed training FLOPs for GPT-3, PaLM, GPT-4, Gemini, and the 2025–26 Claude and GPT families.

Model Era

GPT-3 class

Representative Cost

~$4.6M train

Capability Anchor

~57% MMLU

Release Window

2020

Model Era

GPT-4 class

Representative Cost

~$80M train

Capability Anchor

~86% MMLU

Release Window

2023

Model Era

Frontier 2025

Representative Cost

~$500M train

Capability Anchor

~92% MMLU

Release Window

2025

Model Era

Frontier 2026

Representative Cost

~$1B+ train

Capability Anchor

Approaching saturation

Release Window

2026

Model Era	Representative Cost	Capability Anchor	Release Window
GPT-3 class	~$4.6M train	~57% MMLU	2020
GPT-4 class	~$80M train	~86% MMLU	2023
Frontier 2025	~$500M train	~92% MMLU	2025
Frontier 2026	~$1B+ train	Approaching saturation	2026

Two things jump out. Training cost has gone up ~200× in six years. And the capability curve has flattened at the top — MMLU is saturated, so it stopped being a useful yardstick some time ago.

Inference is the story you are actually paying for

Training is what OpenAI and Anthropic pay. Inference is what *you* pay. And inference cost per unit of capability is falling much faster than training cost is rising.

Year

2023

Representative Model

GPT-4 (8k)

$ per 1M input tokens

$30.00

$ per 1M output tokens

$60.00

Year

2024

Representative Model

GPT-4o / Claude 3.5

$ per 1M input tokens

$2.50–$3.00

$ per 1M output tokens

$10.00–$15.00

Year

2025

Representative Model

GPT-5 / Claude 4.5

$ per 1M input tokens

$1.25–$3.00

$ per 1M output tokens

$5.00–$15.00

Year

2026

Representative Model

Haiku-class frontier

$ per 1M input tokens

$0.25–$1.00

$ per 1M output tokens

$1.25–$5.00

Year	Representative Model	$ per 1M input tokens	$ per 1M output tokens
2023	GPT-4 (8k)	$30.00	$60.00
2024	GPT-4o / Claude 3.5	$2.50–$3.00	$10.00–$15.00
2025	GPT-5 / Claude 4.5	$1.25–$3.00	$5.00–$15.00
2026	Haiku-class frontier	$0.25–$1.00	$1.25–$5.00

A 10–30× drop in three years, at the same or better quality. This is the part of the curve builders actually feel. The product you could not afford to ship in 2024 is profitable in 2026. The feature you prototyped and shelved because the per-request cost was $0.40 is now $0.02.

Benchmark saturation is a signal, not a victory lap

Benchmark

MMLU

2022 SOTA

67%

2024 SOTA

88%

2026 SOTA

~93%

Human expert

~90%

Benchmark

HumanEval

2022 SOTA

48%

2024 SOTA

92%

2026 SOTA

~98%

Human expert

~95%

Benchmark

GPQA (hard science)

2022 SOTA

—

2024 SOTA

50%

2026 SOTA

~80%

Human expert

~70%

Benchmark

SWE-bench Verified

2022 SOTA

—

2024 SOTA

18%

2026 SOTA

70%+

Human expert

—

Benchmark	2022 SOTA	2024 SOTA	2026 SOTA	Human expert
MMLU	67%	88%	~93%	~90%
HumanEval	48%	92%	~98%	~95%
GPQA (hard science)	—	50%	~80%	~70%
SWE-bench Verified	—	18%	70%+	—

When a benchmark saturates against expert humans, it stops being a measurement and becomes a floor. The interesting question shifts from "can the model do this" to "how reliably, how cheaply, and at what latency."

What this means if you are building

The compounding lesson

**Process Steps:**

Prototype on frontier model → Ship → Wait 6-12 months → Migrate to smaller cheaper model at same quality → Unit economics flip

**Time Investment:**

1 week → ship → wait → 2 days migrate → profitable

**Total Duration:** 3-4 quarters

**Key Advantage:** The tailwind is doing half your work. If your product is GPT-4-class capable today, it will be Haiku-class cost within 12 months without you touching the model.

Where builders get this wrong

Three failure modes I keep seeing:

Over-engineering for today's cost. Writing brittle caching layers and distillation pipelines to make a 2024-era bill survivable, when the 2025 model priced it in for you six months later.

2. **Under-engineering for today's capability.** Still routing every request through a small model when the frontier model is 4× better at the thing that actually matters in your product, and now costs within an order of magnitude.

3. **Building the wrong moat.** "We use GPT-4" is not a moat. "We built the interface and the evals and the domain data that make an agent trustworthy in our specific workflow" is.

The actual rate of change, felt in a product

Year

2022

What took an afternoon

Classification prompts

What took a week

Semantic search

What was unbuildable

Multi-step agents

Year

2024

What took an afternoon

Semantic search

What took a week

Multi-step agents

What was unbuildable

Reliable tool use on messy data

Year

2026

What took an afternoon

Multi-step agents

What took a week

Voice-first agents over live APIs

What was unbuildable

Long-horizon autonomy

Year	What took an afternoon	What took a week	What was unbuildable
2022	Classification prompts	Semantic search	Multi-step agents
2024	Semantic search	Multi-step agents	Reliable tool use on messy data
2026	Multi-step agents	Voice-first agents over live APIs	Long-horizon autonomy

The unbuildable row is the one worth watching. The Osmo-class product — voice-first AI that touches your calendar, your email, your actual day — was not a 2024 product. It is a 2026 product. Not because the idea was new, but because the reliability floor finally crossed the threshold where someone would actually let it act on their behalf.

The honest summary

AI progress is not "everything changes tomorrow" and it is not "all hype." It is a compounding curve where training costs go up and inference costs fall much faster, and where the gap between those two is the window consumer products get built in.

If you are building, the job is to pick an idea that was unbuildable 18 months ago, is just barely shippable today, and will be trivially cheap 12 months from now. Everything else is either chasing last year's product or building for a frontier that does not exist yet.