Token-based Pricing

LLM APIs bill per token, not per request, and almost always charge output tokens several times more than input — so cost scales with how much text flows through the agent.

Why it matters

An agent loop re-sends its growing history every turn, so naive cost grows quadratically with conversation length even if each new message is small. Output is the expensive half, which is why verbose or reasoning-heavy responses dominate the bill. Understanding the meter is what lets you cut spend 5–10× with caching, routing, and budget control rather than just hoping.

How it works

Cost per call ≈ input_tokens × in_rate + output_tokens × out_rate, quoted per 1M tokens. Output typically costs 3–5× input.

  • Re-billed context. Turn N pays for all prior turns again as input; a 20-turn chat can spend more on resent history than on new content.
  • Prompt caching discounts repeated prefixes (system prompt, tools, few-shot) by ~50–90% on a cache hit — order stable content first.
  • Batch APIs offer ~50% off for async, non-urgent jobs (evals, backfills).
  • Reasoning tokens count. Hidden thinking is billed as output; an o-series step can cost far more than its visible reply.
LeverEffect on cost
Cap max_tokensbounds the pricey output side
Prompt caching−50–90% on repeated prefix
Cheaper model for easy stepsoften −5–10×
Summarize old turnsshrinks re-billed input
Batch endpoint~−50% (async)

Example

Per-call math at 15 per 1M (in/out):

input  4,000 tok × $3 /1M  = $0.012
output   800 tok × $15/1M  = $0.012
per call ≈ $0.024
agent does 6 LLM calls/task → ~$0.14/task → ~$14 per 100 tasks

Add prompt caching on a 3K-token static prefix (90% off) and route 3 of 6 calls to a cheap model: same task drops to roughly $0.04 — a ~3–4× cut with no quality loss on the easy steps.

Pitfalls

  • Ignoring re-billed history. Long loops blow the budget on resent context; cap turns and summarize (see context-windows).
  • Optimizing input, ignoring output. Output is the costly side — trimming a verbose prompt saves little if replies stay long; cap max_tokens.
  • Cache-busting prefixes. A timestamp or per-user token at the start of the prompt voids the cache; put volatile content last.
  • No per-user/run budget cap. A runaway agent can loop and spend thousands; enforce hard token and call ceilings.

See also