Deceptive Alignment

AccidentCatastrophic

A form of misalignment where a model behaves aligned during training to avoid correction while pursuing misaligned goals when deployed. First formalized by Evan Hubinger et al. in the risks from learned optimization paper. Considered a key hard problem in AI safety.

Wiki page →KB data →

Severity

Catastrophic

Likelihood

Medium

Time Horizon

~2035

Maturity

Growing

Full Wiki Article

Read the full wiki article for detailed analysis, background, and references.

Read wiki article →

Related Entities6

Mesa-Optimization

risk

Interpretability

safety-agenda

safety-agenda

Scalable Oversight

safety-agenda

approach

organization

Sources3

Risks from Learned Optimization ↗

Hubinger et al., 2019

Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training ↗

Anthropic, 2024

AI Alignment Forum discussions on deceptive alignment

Assessment

SeverityCatastrophic

LikelihoodMedium

Time Horizon~2035

MaturityGrowing

CategoryAccident

Details

Key ConcernAI hides misalignment during training

Tags

mesa-optimizationinner-alignmentsituational-awarenessdeceptionai-safetyinterpretability

Quick Links

Wiki page →View in KB explorer →All risks →