Metrics to Track

The small set of quality, cost, latency, and reliability numbers you instrument on every agent run so you can tell whether a change made things better or worse.

Why it matters

Agents fail quietly: a prompt tweak that fixes one case silently regresses ten others, and you only find out from users. Without metrics you are flying blind on the three axes that matter — is it correct, is it fast enough, is it affordable — and you can’t compare model A vs B or yesterday vs today. The point is to make every change falsifiable against a baseline.

How it works

Group metrics by axis and track each per-step and per-task, because a cheap step inside a 30-step loop can still blow the budget.

Axis	Metric	Why
Quality	task success rate	did it actually finish the goal
Quality	tool-call validity %	malformed/failed calls per run
Cost	tokens & $ per task	token-based-pricing, not per call
Latency	p50 / p95 / p99	tails kill UX, not the mean
Latency	time-to-first-token	streaming feel
Reliability	error / retry rate	API + tool failures
Loop	steps per task	drift / runaway detector

Per-task aggregation. One agent-loop = many model calls; roll cost and latency up to the task so a 6-call task shows as one $ number.
Percentiles, not averages. A mean of 2s hides a p99 of 40s; always chart p95/p99.
Online vs offline. Online = live traffic dashboards; offline = a frozen eval set scored in CI (see integration-testing-for-flows).
Quality needs a judge. Success is rarely a string match — use a checker, an LLM judge, or human labels.

Example

A dashboard row for a RAG agent over 1,000 runs:

task success      82%   (target 90%)
tool-call valid   96%
steps/task p50/p95   4 / 11   ← p95 spike = looping
tokens/task        9.2k     $0.041/task
latency p50/p95    3.1s / 18s  ← investigate the 18s tail

The p95 of 11 steps flags runs where the loop never converges — invisible in the average of ~5.

Pitfalls

Tracking averages only. The mean looks fine while the p99 is on fire; users live in the tail.
Per-call cost, not per-task. A “cheap” call × 30 loop iterations is an expensive task; aggregate up.
No baseline. A 78% success rate is meaningless without last week’s number to compare against.
Vanity metrics. Total request count or token volume don’t tell you if the agent works; anchor on success and cost-per-task.

tech-studies

Explorer

Metrics to Track

Metrics to Track

Why it matters

How it works

Example

Pitfalls

See also

Graph View

Table of Contents

Backlinks