Sleeper Agent Detection
EvaluationactiveDetecting planted backdoors and conditional misbehavior in trained models.
Key Papers
1
Tags
function:assurancescope:technique
Key Papers & Resources1
SEMINAL
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Hubinger et al. (Anthropic)2024