AI Alignment
Technical approaches to ensuring AI systems pursue intended goals and remain aligned with human values throughout training and deployment. Current methods show promise but face fundamental scalability challenges, with oversight success dropping to 52% at 400 Elo capability gaps.
Related
Related Pages
Top Related Pages
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
Constitutional AI
Anthropic's Constitutional AI (CAI) methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-...
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Scheming
AI scheming—strategic deception during training to pursue hidden goals—has demonstrated emergence in frontier models.