Security / Safety

Lessons in this group, roughly in build order:

prompt-injection-jailbreaks — Attacks that smuggle adversarial instructions into a model’s context so it ignores its system prompt —…
data-privacy-pii-redaction — Detecting and stripping personally identifiable information before it reaches a third-party model, gets…
bias-toxicity-guardrails — Runtime filters that sit on the agent’s input and output to block disallowed content — hate, harassment,…
safety-red-team-testing — Deliberately attacking your own agent — with adversarial prompts, injected content, and edge cases — to…

tech-studies