Sleeper Agents: Training Deceptive LLMs
AccidentAnthropic's 2024 research demonstrating that large language models can be trained to exhibit persistent deceptive behavior that survives standard safety training techniques including supervised fine-tuning, RLHF, and adversarial training.
Full Wiki Article
Read the full wiki article for detailed analysis, background, and references.
Read wiki article →Related Entities5
risk
person
organization
concept
Assessment
CategoryAccident
Tags
deceptive-alignmentbackdoor-attackssafety-training-failureadversarial-traininganthropic-research