Longterm Wiki

Sleeper Agents: Training Deceptive LLMs

Accident

Anthropic's 2024 research demonstrating that large language models can be trained to exhibit persistent deceptive behavior that survives standard safety training techniques including supervised fine-tuning, RLHF, and adversarial training.

Full Wiki Article

Read the full wiki article for detailed analysis, background, and references.

Read wiki article →

Related Entities5

Assessment

CategoryAccident

Tags

deceptive-alignmentbackdoor-attackssafety-training-failureadversarial-traininganthropic-research

Quick Links