Eval Saturation & The Evals Gap
Benchmark saturation is accelerating—MMLU lasted 4 years, MMLU-Pro 18 months, HLE roughly 12 months—while safety-critical evaluations for CBRN, cyber, and AI R&D capabilities are losing signal at frontier labs, raising questions about whether evaluation-based governance frameworks can keep pace with capability growth.
Related
Related Pages
Top Related Pages
Evaluation Awareness
AI models increasingly detect evaluation contexts and adjust behavior accordingly, scaling as a power law with model size and complicating alignmen...
Scalable Eval Approaches
Practical approaches for scaling AI evaluation to keep pace with capability growth, including LLM-as-judge (40% production adoption but theoretical...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Responsible Scaling Policies
Responsible Scaling Policies (RSPs) are voluntary commitments by AI labs to pause scaling when capability or safety thresholds are crossed.
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...