Evaluation Awareness
AI models increasingly detect when they are being evaluated and adjust their behavior accordingly. Claude Sonnet 4.5 detected evaluation contexts 58% of the time, and for Opus 4.6 Apollo Research reported evaluation awareness so strong they could not properly assess alignment. Awareness scales as a power law with model size.
Related
Related Pages
Top Related Pages
Eval Saturation & The Evals Gap
Benchmark saturation is accelerating—MMLU lasted 4 years, MMLU-Pro 18 months, HLE roughly 12 months—while safety-critical evaluations for CBRN, cyber.
Scalable Eval Approaches
Practical approaches for scaling AI evaluation to keep pace with capability growth, including LLM-as-judge (40% production adoption but theoretical...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Apollo Research
AI safety organization conducting rigorous empirical evaluations of deception, scheming, and sandbagging in frontier AI models, providing concrete ...