Security / Safety

Lessons in this group, roughly in build order:

  • prompt-injection-jailbreaks — Attacks that smuggle adversarial instructions into a model’s context so it ignores its system prompt —…
  • data-privacy-pii-redaction — Detecting and stripping personally identifiable information before it reaches a third-party model, gets…
  • bias-toxicity-guardrails — Runtime filters that sit on the agent’s input and output to block disallowed content — hate, harassment,…
  • safety-red-team-testing — Deliberately attacking your own agent — with adversarial prompts, injected content, and edge cases — to…