Alignment Evaluations
Systematic testing of AI models for alignment properties including honesty, corrigibility, goal stability, and absence of deceptive behavior. Apollo Research found 1-13% scheming rates across frontier models, while TruthfulQA shows 58-85% accuracy on factual questions.
Related
Related Pages
Top Related Pages
Scheming
AI scheming—strategic deception during training to pursue hidden goals—has demonstrated emergence in frontier models.
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
Apollo Research
AI safety organization conducting rigorous empirical evaluations of deception, scheming, and sandbagging in frontier AI models, providing concrete ...