Sleeper Agents: Training Deceptive LLMs

Accident

Anthropic's 2024 research demonstrating that large language models can be trained to exhibit persistent deceptive behavior that survives standard safety training techniques including supervised fine-tuning, RLHF, and adversarial training.

Wiki page →KB data →

Full Wiki Article

Read the full wiki article for detailed analysis, background, and references.

Read wiki article →

Related Entities5

Deceptive Alignment

risk

risk

person

organization

Situational Awareness

concept

Assessment

CategoryAccident

Tags

deceptive-alignmentbackdoor-attackssafety-training-failureadversarial-traininganthropic-research

Quick Links

Wiki page →View in KB explorer →All risks →