Longterm Wiki

Sleeper Agent Detection

Methods to detect AI models that behave safely during training and evaluation but defect under specific deployment conditions, addressing the core threat of deceptive alignment through behavioral testing, interpretability, and monitoring approaches. Current methods achieve only 5-40% success rates.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

Sleeper Agents: Training Deceptive LLMs

Analysis

Model Organisms of MisalignmentAI Safety Technical Pathway Decomposition

Approaches

Constitutional AIScheming & Deception DetectionSparse Autoencoders (SAEs)AI AlignmentAI-Human Hybrid Systems

Other

Paul ChristianoDario Amodei

Concepts

Alignment Evaluation OverviewAgentic AILong-Horizon Autonomous TasksDense Transformers

Organizations

Redwood Research

Key Debates

AI Alignment Research AgendasTechnical AI Safety Research

Historical

Deep Learning Revolution EraMainstream Era

Tags

sleeper-agentsbackdoor-detectiondeceptive-alignmentinterpretabilityai-control