Latency vs throughput

Latency is the time for one request to complete; throughput is how many requests complete per unit time — related but independently optimizable.

Why it matters

You design for both, and they trade off. Batching and queuing raise throughput but add latency; replicating for low latency can cap throughput per node. SLOs are usually written on tail latency (p99/p99.9), not the average, because at scale the slow tail is what users actually hit — a request that fans out to 100 services sees the slowest of 100, so a rare 1-in-100 stall becomes the common case.

How it works

Think of a pipe: latency is its length, throughput is its width. Little’s Law ties them: L = λ × W — concurrent requests in flight = arrival rate × latency. A useful corollary: with bounded concurrency, lower latency directly buys higher throughput.

  • Report percentiles, never just the mean — p50/p95/p99/p99.9. One GC pause skews the mean but shows up honestly at the tail.
  • Batching/pipelining amortizes fixed cost (round trips, fsync) across many items → throughput up, per-item latency up.
  • A queue (message-queues) decouples producer/consumer rate; under overload it must shed or apply back-pressure or latency grows unbounded.
LeverLatencyThroughput
Batch writesworsebetter
Add replica/shardsame/betterbetter
Bigger queueworsehides spikes

A rough latency ladder worth memorizing: L1 ~1 ns, main memory ~100 ns, SSD read ~100 µs, intra-DC round trip ~0.5 ms, cross-continent ~150 ms.

Example

A logging service must absorb 1M events/s. Writing each individually fsyncs per event (~1 ms each) and collapses. Batching 10K events per flush raises throughput past 1M/s while adding only ~10 ms of buffering latency — a deliberate latency-for-throughput trade users never notice on a log.

Pitfalls

  • Tuning the mean — a great average can hide a brutal p99.9.
  • Unbounded queues trading a fast failure for slowly rising latency until OOM; bound them and apply back-pressure.
  • Ignoring coordinated omission — load tools that pause during stalls under-report tail latency.

See also