Longterm Wiki

Eval Saturation & The Evals Gap

Benchmark saturation is accelerating—MMLU lasted 4 years, MMLU-Pro 18 months, HLE roughly 12 months—while safety-critical evaluations for CBRN, cyber, and AI R&D capabilities are losing signal at frontier labs, raising questions about whether evaluation-based governance frameworks can keep pace with capability growth.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

SchemingMultipolar Trap (AI Development)

Analysis

OpenAI Foundation Governance ParadoxLong-Term Benefit Trust (Anthropic)

Organizations

Apollo ResearchMETRUK AI Safety InstituteEpoch AICenter for AI Safety

Other

InterpretabilityYoshua BengioDan Hendrycks

Concepts

Alignment Evaluation OverviewGovernance-Focused WorldviewReasoning and PlanningSituational Awareness

Policy

Voluntary AI Safety Commitments

Key Debates

AI Safety Solution CruxesCorporate Influence on AI Policy

Tags

benchmarksevaluation-gapresponsible-scalingsafety-evalsgovernance