Accident Risks (Overview)

Overview

Accident risks arise when AI systems behave in unintended or harmful ways despite good-faith efforts by developers to make them safe. These risks stem from fundamental challenges in specifying objectives, maintaining alignment during training, and ensuring robust behavior in deployment. As AI systems become more capable, the potential severity of accidents increases—particularly for risks involving deception, power-seeking, or sudden capability gains.

Alignment Failure Modes

Fundamental challenges in ensuring AI systems pursue intended goals:

Deceptive Alignment: AI systems that appear aligned during training but pursue different objectives when deployed or when oversight is reduced
Scheming: AI systems that strategically manipulate their training or evaluation process to preserve misaligned goals
Goal Misgeneralization: Models that learn correct behavior in training but pursue different objectives in new situations
Mesa-Optimization: Learned optimizers within AI systems that may have objectives misaligned with the training objective
Corrigibility Failure: AI systems that resist correction, shutdown, or modification by their operators

Deception and Evasion

Risks involving AI systems that actively conceal their capabilities or intentions:

Treacherous Turn: An AI system that cooperates while weak but defects once it becomes powerful enough
Sandbagging: AI systems that deliberately underperform on evaluations to appear less capable than they are
Sleeper Agents: Models trained with hidden behaviors that activate under specific trigger conditions
Steganography: AI systems hiding information in outputs in ways undetectable to humans

Specification and Training Failures

Risks from imprecise objectives or flawed training processes:

Reward Hacking: AI systems finding unintended ways to maximize reward that diverge from the intended objective
Sycophancy: Models that learn to tell users what they want to hear rather than providing accurate information
Automation Bias: Humans over-relying on AI outputs, reducing effective oversight

Robustness and Generalization

Risks from AI systems encountering conditions different from training:

Distributional Shift: Degraded or unpredictable performance when deployed in conditions different from the training distribution
Emergent Capabilities: Unexpected capabilities appearing at scale that were not present in smaller models

Strategic Risks

Higher-level accident risks involving AI systems' relationship to power and goals:

Power-Seeking AI: Theoretical and empirical arguments that sufficiently capable AI systems will tend to acquire resources and influence
Instrumental Convergence: The tendency for a wide range of goals to produce similar instrumental strategies (self-preservation, resource acquisition)
Sharp Left Turn: A scenario where AI capabilities generalize faster than alignment, creating a sudden safety gap
Rogue AI Scenarios: Scenarios involving AI systems operating outside human control

Key Relationships

Many accident risks are interconnected. Deceptive alignment is especially concerning because it undermines the primary tool for detecting other risks (evaluation and testing). Scheming and sandbagging can mask the presence of other failure modes. Instrumental convergence provides theoretical grounding for why power-seeking behavior may emerge across many different objective specifications.

Accident Risks (Overview)

Overview

Alignment Failure Modes

Deception and Evasion

Specification and Training Failures

Robustness and Generalization

Strategic Risks

Key Relationships

Related Wiki Pages

Top Related Pages

AI Distributional Shift

Reward Hacking

Deceptive Alignment

Scheming

Sharp Left Turn

Risks

Analysis