Deceptive Alignment
AccidentCatastrophicA form of misalignment where a model behaves aligned during training to avoid correction while pursuing misaligned goals when deployed. First formalized by Evan Hubinger et al. in the risks from learned optimization paper. Considered a key hard problem in AI safety.
Severity
Catastrophic
Likelihood
Medium
Time Horizon
~2035
Maturity
Growing
Full Wiki Article
Read the full wiki article for detailed analysis, background, and references.
Read wiki article →Related Entities6
safety-agenda
safety-agenda
safety-agenda
approach
organization
Sources3
Hubinger et al., 2019
AI Alignment Forum discussions on deceptive alignment
Assessment
SeverityCatastrophic
LikelihoodMedium
Time Horizon~2035
MaturityGrowing
CategoryAccident
Details
Key ConcernAI hides misalignment during training
Tags
mesa-optimizationinner-alignmentsituational-awarenessdeceptionai-safetyinterpretability