Pricing of Common Models

A practical map of what the major LLM tiers cost per million tokens, and how to pick a tier so an agent isn’t paying frontier rates for trivial steps.

Why it matters

Across a model family, prices span ~100× from the cheapest small model to the top reasoning model — so which model runs each step dwarfs almost every other optimization. Agents make many calls per task, and most are easy (routing, formatting, a tool call). Matching tier to difficulty is the single biggest lever on the bill, ahead of prompt trimming. (See the meter mechanics in token-based-pricing.)

How it works

Vendors stratify a family into small / mid / large (frontier or reasoning) tiers. Output is the costly side, so reasoning models — which emit many hidden thinking tokens — are pricey even when the visible answer is short.

Treat all numbers as order-of-magnitude; published rates change often, so verify against the vendor’s live pricing page.
Default to a small/mid model, escalate to a large or reasoning-vs-standard-models tier only on hard steps.
Stack discounts: prompt caching, batch (~−50%), and shorter outputs apply on top of the per-token rate.

Tier (typical)	~Input $/1M	~Output $/1M	Use in an agent
Small (Haiku, GPT-4o-mini, Flash)	$0.15–0.50	$0.60–2	routing, parsing, tool calls
Mid (Sonnet, GPT-4o)	$2.5–5	$10–15	main reasoning, drafting
Large/reasoning (Opus, o-series)	$10–20	$40–75	hard planning, math, code

Example

A task makes 6 calls; compare all-frontier vs tiered routing (rates above):

all on mid ($3/$15):   6 calls × ~$0.024  ≈ $0.14/task
tiered:
  4 easy calls → small (~$0.002 each)     ≈ $0.008
  2 hard calls → mid    (~$0.024 each)    ≈ $0.048
  total                                   ≈ $0.056/task  (~2.5× cheaper)

Push the 4 easy calls onto a small model and most of the spend evaporates while the hard reasoning still runs on the capable tier.

Pitfalls

Hard-coding stale prices. Rates and tiers shift monthly; read live pages and re-check before forecasting spend.
One model for everything. Running an entire loop on a frontier/reasoning tier overpays 5–20× for steps a small model nails.
Comparing input rates only. Output and hidden reasoning tokens drive cost; a “cheap input” reasoning model can be the most expensive overall.
Ignoring caching/batch in estimates. Forecasting at list price overstates spend ~2× when a stable prefix or async batch applies.

tech-studies

Explorer

Pricing of Common Models

Pricing of Common Models

Why it matters

How it works

Example

Pitfalls

See also

Graph View

Table of Contents

Backlinks