Security / Safety
Lessons in this group, roughly in build order:
- prompt-injection-jailbreaks — Attacks that smuggle adversarial instructions into a model’s context so it ignores its system prompt —…
- data-privacy-pii-redaction — Detecting and stripping personally identifiable information before it reaches a third-party model, gets…
- bias-toxicity-guardrails — Runtime filters that sit on the agent’s input and output to block disallowed content — hate, harassment,…
- safety-red-team-testing — Deliberately attacking your own agent — with adversarial prompts, injected content, and edge cases — to…