Longterm Wiki

Deceptive Alignment

AccidentCatastrophic

A form of misalignment where a model behaves aligned during training to avoid correction while pursuing misaligned goals when deployed. First formalized by Evan Hubinger et al. in the risks from learned optimization paper. Considered a key hard problem in AI safety.

Severity
Catastrophic
Likelihood
Medium
Time Horizon
~2035
Maturity
Growing

Full Wiki Article

Read the full wiki article for detailed analysis, background, and references.

Read wiki article →

Related Entities6

safety-agenda
safety-agenda
safety-agenda
organization

Sources3

AI Alignment Forum discussions on deceptive alignment

Assessment

SeverityCatastrophic
LikelihoodMedium
Time Horizon~2035
MaturityGrowing
CategoryAccident

Details

Key ConcernAI hides misalignment during training

Tags

mesa-optimizationinner-alignmentsituational-awarenessdeceptionai-safetyinterpretability

Quick Links