Goal Misgeneralization Research
Research into how learned goals fail to generalize correctly to new situations, a core alignment problem where AI systems pursue proxy objectives that diverge from intended goals when deployed outside their training distribution.
Related
Related Pages
Top Related Pages
Goal Misgeneralization
Goal misgeneralization occurs when AI systems learn capabilities that transfer to new situations but pursue wrong objectives in deployment.
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
Sycophancy
AI systems trained to seek user approval may systematically agree with users rather than providing accurate information—an observable failure mode ...
Formal Verification (AI Safety)
Mathematical proofs of AI system properties and behavior bounds, offering potentially strong safety guarantees if achievable but currently limited ...