Bias / Toxicity Guardrails

Runtime filters that sit on the agent’s input and output to block disallowed content — hate, harassment, self-harm, illegal advice — and to catch biased or unfair responses before a user sees them.

Why it matters

A base model will happily produce toxic or skewed output given the right prompt; alignment training reduces this but does not guarantee it, and your brand owns whatever ships. Guardrails are the deterministic safety net outside the model: they enforce policy you can audit and version, independent of which model or prompt is live. They also catch the long tail RLHF misses and give you a logged, blockable boundary for compliance.

How it works

Two checkpoints — an input guard and an output guard — wrap the agent-loop. Each can allow, block, or rewrite.

  • Classifier-based — a dedicated safety model scores categories (Llama Guard, OpenAI moderation, Azure Content Safety) with per-category severity; cheap and fast (~tens of ms).
  • Rules / lexicons — regex and banned-term lists for crisp, deterministic cases (slurs, competitor names); zero false-negatives on what they cover, brittle on paraphrase.
  • LLM-as-judge — a second model checks nuanced policy (“is this medical advice?”); flexible but slower and itself fallible.
  • On a hit — block with a safe refusal, or rewrite/soften, or route to a human. Log every decision for tuning the threshold.
LayerLatencyCatchesWeakness
Lexicon / regex~1 msexact termstrivial to paraphrase
Classifier model~30 msbroad categoriesfixed taxonomy
LLM judge~300 ms+nuanced policycost, own errors

Bias is the harder half: it’s statistical, not a banned word — measure it offline (refusal/sentiment rates across demographic-swapped prompts) and tune, since no single response is obviously “biased”.

Example

output guard, threshold = 0.7
model → "People from <group> are usually..."
moderation: {harassment: 0.81, hate: 0.66}
0.81 ≥ 0.7 → BLOCK
user sees: "I can't help with generalizations about groups."
log: {hit: harassment, score: .81, turn_id, prompt_hash}

The toxic completion was generated but never delivered; the block is logged for audit and threshold tuning.

Pitfalls

  • Output-only guarding. A clean prompt can still elicit toxicity, and a toxic prompt poisons the loop — guard both ends.
  • Threshold theatre. Too strict over-refuses benign queries (hurts UX); too loose leaks. Pick per-category thresholds from labeled data, don’t eyeball one global number.
  • Guardrails ≠ bias-free. Toxicity filters miss subtle, fluent bias; bias needs separate offline evaluation across protected attributes.
  • Multilingual / obfuscation gaps. Many filters are English-first and fooled by leetspeak or code-switching; test the languages and tricks you’ll actually see (see safety-red-team-testing).

See also