Human-in-the-loop Evaluation

Using human judgment — ratings, pairwise preferences, or approvals — to score agent output on the subjective dimensions automated metrics can’t capture.

Why it matters

“Is this answer helpful, on-tone, and safe?” has no string match. Humans are the ground truth that calibrates cheaper proxies: you collect a few hundred human labels, then validate that an LLM judge agrees with them before trusting the judge to scale. Human review is also the literal control surface for risky actions — a human approving a payment or an email send is HITL acting as both gate and evaluator.

How it works

Two distinct modes, often confused: evaluation (offline, scoring quality) and oversight (inline, approving an action mid-agent-loop).

Method	What annotators do	Best for
Direct rating	score 1–5 / thumbs	absolute quality tracking
Pairwise A/B	pick better of two	comparing model/prompt versions
Binary pass/fail	meets rubric?	regression gates
Approve / edit	accept or fix an action	inline oversight

Pairwise beats Likert. Humans rank two outputs far more consistently than they assign absolute scores; aggregate wins into an Elo or win-rate.
Rubrics + multiple raters. A written rubric plus 2–3 raters per item, measuring inter-annotator agreement (Cohen’s κ), turns vibes into a defensible number.
Inline gate. Pause the run (LangGraph interrupt), surface the proposed tool call, resume on approval — this is langgraph’s human-in-the-loop pattern.
Calibrate the judge. Use human labels as the gold set to measure whether an automated judge’s verdicts correlate before replacing humans.

Example

Shipping a new system prompt for a support agent:

sample 200 real conversations → run old vs new prompt
present blind A/B to 3 reviewers (rubric: correct, polite, complete)
  new wins 124, old wins 61, tie 15  → win-rate 67%
inter-rater κ = 0.71 (substantial agreement) → trust the result → ship

The 67% blind win-rate is decision-grade evidence; a raw “feels better” is not.

Pitfalls

No rubric. Without explicit criteria, raters drift and agreement collapses; ratings become noise.
Single annotator. One person’s bias becomes ground truth; use multiple and report κ.
Position/length bias. Reviewers favor the first-shown or longer answer — randomize order, control for length.
Reviewing everything. Human review doesn’t scale to all traffic; sample, or gate only high-risk actions (see safety-red-team-testing).

tech-studies

Explorer

Human-in-the-loop Evaluation

Human-in-the-loop Evaluation

Why it matters

How it works

Example

Pitfalls

See also

Graph View

Table of Contents

Backlinks