Sleeper Agent Detection
Methods to detect AI models that behave safely during training and evaluation but defect under specific deployment conditions, addressing the core threat of deceptive alignment through behavioral testing, interpretability, and monitoring approaches. Current methods achieve only 5-40% success rates.
Related
Related Pages
Top Related Pages
AI Control
A defensive safety approach maintaining control over potentially misaligned AI systems through monitoring, containment, and redundancy, offering 40...
Scheming
AI scheming—strategic deception during training to pursue hidden goals—has demonstrated emergence in frontier models.
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
Scalable Oversight
Methods for supervising AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and process super...