Safety / Red-Team Testing

Deliberately attacking your own agent — with adversarial prompts, injected content, and edge cases — to find safety failures before users or attackers do, then locking each one down with a regression test.

Why it matters

Guardrails and prompts are claims; red-teaming is the evidence. Because LLMs are non-deterministic, a single passing demo proves nothing — you need adversarial coverage to know the failure rate. It turns “we have safety” into a measurable number (attack success rate) and converts every breach into a permanent test, so a model swap or prompt edit can’t silently reopen a hole.

How it works

Treat safety like a test suite: a corpus of attacks, an automated runner, a pass/fail judge, and a tracked metric. Mix manual creativity with automated scale.

  • Attack corpus — jailbreaks, indirect injections in tool content, PII exfil attempts, toxicity elicitation, and the agent’s specific tool risks (can a prompt make it delete or spend?).
  • Automated red-teaming — tools (Garak, PyRIT, promptfoo, DeepEval) mutate seeds and run hundreds of variants; an LLM-judge or rule scores each outcome.
  • MetricAttack Success Rate = breaches / attempts. Track it per category over time; set a release gate (e.g. ASR < 1% on the known set).
  • Close the loop — every successful attack becomes a fixed regression case + a guardrail update; re-run in CI on every prompt/model change.
StageOutput
Generate attacksseed prompts × mutations
Run against agentresponses + tool calls
Judgepass / fail per attempt
ReportASR by category
Regressfailures pinned in CI

Example

suite: injection (50), jailbreak (40), pii-exfil (20), toxicity (30)
run vs agent-v0.4:
  injection   3/50  fail  → ASR 6.0%   ← over gate (1%)
  jailbreak   0/40        → 0%
  pii-exfil   1/20  fail  → 5.0%
fix: add egress allow-list + tighten tool scope
rerun → injection 0/50, pii 0/20 → ship; 4 cases pinned in CI

The 4 breaches become permanent tests; a later model upgrade that reintroduces one fails the build.

Pitfalls

  • One run isn’t a result. Non-determinism means you must run each attack n times (sampling) and report a rate, not a single pass.
  • Stale corpus. New jailbreaks appear weekly; a frozen attack set rots — refresh it and add every real incident.
  • Judging is hard. An LLM-judge can mislabel a refusal as a breach (or miss a subtle one); spot-check the judge against human labels.
  • Testing the model, not the agent. Safety lives in the whole loop — tools, retrieval, egress. Red-team end-to-end behavior (see integration-testing-for-flows), not just the bare model.

See also