Metrics to Track

The small set of quality, cost, latency, and reliability numbers you instrument on every agent run so you can tell whether a change made things better or worse.

Why it matters

Agents fail quietly: a prompt tweak that fixes one case silently regresses ten others, and you only find out from users. Without metrics you are flying blind on the three axes that matter — is it correct, is it fast enough, is it affordable — and you can’t compare model A vs B or yesterday vs today. The point is to make every change falsifiable against a baseline.

How it works

Group metrics by axis and track each per-step and per-task, because a cheap step inside a 30-step loop can still blow the budget.

AxisMetricWhy
Qualitytask success ratedid it actually finish the goal
Qualitytool-call validity %malformed/failed calls per run
Costtokens & $ per tasktoken-based-pricing, not per call
Latencyp50 / p95 / p99tails kill UX, not the mean
Latencytime-to-first-tokenstreaming feel
Reliabilityerror / retry rateAPI + tool failures
Loopsteps per taskdrift / runaway detector
  • Per-task aggregation. One agent-loop = many model calls; roll cost and latency up to the task so a 6-call task shows as one $ number.
  • Percentiles, not averages. A mean of 2s hides a p99 of 40s; always chart p95/p99.
  • Online vs offline. Online = live traffic dashboards; offline = a frozen eval set scored in CI (see integration-testing-for-flows).
  • Quality needs a judge. Success is rarely a string match — use a checker, an LLM judge, or human labels.

Example

A dashboard row for a RAG agent over 1,000 runs:

task success      82%   (target 90%)
tool-call valid   96%
steps/task p50/p95   4 / 11   ← p95 spike = looping
tokens/task        9.2k     $0.041/task
latency p50/p95    3.1s / 18s  ← investigate the 18s tail

The p95 of 11 steps flags runs where the loop never converges — invisible in the average of ~5.

Pitfalls

  • Tracking averages only. The mean looks fine while the p99 is on fire; users live in the tail.
  • Per-call cost, not per-task. A “cheap” call × 30 loop iterations is an expensive task; aggregate up.
  • No baseline. A 78% success rate is meaningless without last week’s number to compare against.
  • Vanity metrics. Total request count or token volume don’t tell you if the agent works; anchor on success and cost-per-task.

See also