Eval Saturation & The Evals Gap

Approach

Eval Saturation & The Evals Gap

Analysis of accelerating AI evaluation saturation, showing benchmarks intended to last years are being saturated in months (MMLU ~4 years, MMLU-Pro ~18 months, HLE ~12 months). A 2022 Nature Communications study of 3,765 benchmarks found a 'large fraction quickly trends towards near-saturation.' Safety-critical evaluations face the same dynamic: Anthropic reports Opus 4.6 saturated most automated AI R&D, CBRN, and cyber evaluations; OpenAI cannot rule out High cyber capability for GPT-5.3-Codex. Epoch AI finds publicly described biorisk benchmarks 'essentially entirely saturated.' Apollo Research identifies an 'evals gap' where evaluation quality/quantity required for safety claims outpaces available evals, while evaluation awareness (58% in Claude Sonnet 4.5) and unfaithful chain-of-thought reasoning (25% faithfulness in Claude 3.7 Sonnet) create compounding challenges. The International AI Safety Report 2026 formally identifies the 'evaluation gap' as a central finding. Counter-arguments include LLM-as-judge scaling, adversarial benchmark resurrection, and adaptive evaluation approaches, but time-to-saturation is shrinking and domain-specific safety evals are inherently harder to create than academic benchmarks.

Organizations

Approaches

4.6k words · 3 backlinks

Overview

AI evaluations are saturating faster than new ones can be created. Benchmarks designed to measure frontier model capabilities are being outpaced by rapid capability gains, with time-to-saturation shrinking from years to months. This pattern, initially observed in academic benchmarks like MMLU, has now reached safety-critical evaluations used in responsible scaling policies.

The problem has two distinct dimensions: academic benchmark saturation, where models hit ceiling performance on standardized tests; and safety eval saturation, where evaluations designed to detect dangerous capabilities lose their ability to distinguish between safe and unsafe models. The second dimension is more consequential for AI safety governance, as it undermines the empirical foundation of deployment decisions at frontier labs.

Apollo Research coined the term "evals gap" to describe the growing distance between what evaluations need to measure and what they currently can. The International AI Safety Report 2026, chaired by Yoshua Bengio with 100+ expert contributors, formally identified this "evaluation gap" as a central finding, warning that "AI systems continued to advance rapidly over the past year, but the methods used to test and manage their risks did not keep pace."

Quick Assessment

Dimension	Assessment	Evidence
Severity	High	Anthropic reports Opus 4.6 saturated most automated AI R&D, CBRN, and cyber evaluations
Trend	Accelerating	Time-to-saturation: MMLU ≈4 years, MMLU-Pro ~18 months, HLE ~12 months
Tractability	Medium	LLM-as-judge, adaptive benchmarks, and expanded task suites offer partial solutions
Safety Impact	Critical	Undermines empirical basis of RSPs and preparedness frameworks
Evaluation Awareness	Growing	58% of Claude Sonnet 4.5 test scenarios show explicit evaluation awareness

The Saturation Timeline

Academic Benchmark Saturation

The pattern of accelerating benchmark saturation is well-documented:

Benchmark	Introduced	Saturated	Time to Saturation	Saturating Model	Peak Score
GLUE	2018	≈2020	≈2 years	Multiple	Retired
SuperGLUE	2019	≈2021	≈2 years	Multiple	Retired
MMLU	2020	Sep 2024	≈4 years	OpenAI o1-preview	92.3%
MMLU-Pro	Mid-2024	Nov 2025	≈18 months	Gemini 3 Pro	90.1%
GPQA Diamond	2023	Nov 2025	≈2 years	Gemini 3 Pro	93.8% (above ≈80% human ceiling)
HLE	Late 2024	Dec 2025	≈12 months	GPT-5.2	≈50% (near ~51.3% ceiling)
FrontierMath (Tier 1-3)	2024	Partial	Ongoing	GPT-5	≈29% (estimated 70% ceiling)

The trend is clear: each successive "harder" benchmark is saturated faster than its predecessor. MMLU, introduced specifically because earlier benchmarks were saturated, lasted roughly four years. Its harder successor MMLU-Pro lasted about 18 months. Humanity's Last Exam, designed as an "ultimate academic exam" with 2,500 expert-level questions across 100+ subjects, reached near-ceiling performance within approximately one year.

This pattern is not new. A 2022 Nature Communications study analyzing 3,765 AI benchmarks across computer vision and NLP found that "a large fraction of benchmarks quickly trends towards near-saturation" and that many benchmarks fail to find widespread utilization. Of 1,318 NLP benchmarks with at least one result reported, only 661 (50%) had results at three or more time points. A separate cluster of 378 benchmarks corresponded to a "saturation/stagnation" pattern, reaching near-ceiling performance very quickly after introduction.

Safety Evaluation Saturation

More consequentially, safety-critical evaluations are experiencing the same dynamic. Key evidence from 2025-2026:

Anthropic (Opus 4.6, February 2026):

Automated AI R&D evaluations "saturated," "no longer provide useful evidence for ruling out ASL-4 level autonomy"
CBRN automated benchmarks "largely saturated and no longer provide meaningful signal for rule-out"
Saturated approximately 100% of cyber evaluations
Determination for AI R&D rested primarily on an internal staff survey (0/16 believed model could be made into drop-in replacement for entry-level researcher within 3 months)

OpenAI (GPT-5.3-Codex, February 2026):

Treated as "High" cyber capability under OpenAI's Preparedness Framework despite lacking definitive evidence, because the company "cannot rule out the possibility" that it reaches the threshold
Achieved 77.6% on cybersecurity CTF challenges
Acknowledged that "because of limitations in evaluations, excelling on all three evaluations is necessary but not sufficient"

METR (GPT-5, mid-2025; Time Horizon 1.1, January 2026):

GPT-5 "getting close to saturating many of the tasks in HCAST"
Expanded task suite by 34% (170 to 228 tasks) and doubled 8+ hour tasks (14 to 31) in Time Horizon 1.1
Despite expansion: "even our Time Horizon 1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully"

Domain-Specific Eval Gaps

The saturation problem manifests differently across safety-critical domains:

CBRN (Chemical, Biological, Radiological, Nuclear): Epoch AI analyzed the state of biorisk evaluations and concluded that "publicly described biorisk benchmarks are essentially entirely saturated, with a fairly substantial gap to the real-world." Among specific benchmarks: WMDP tests generally hazardous knowledge in textbook-like fashion; LAB-Bench builds on published information general to biology work; only the Virology Capabilities Test (VCT) "explicitly targets tacit knowledge that would be practically relevant to animal pathogen creation." Epoch AI recommends benchmark creators focus on "more-challenging and specialized evaluations with thorough quality assurance measures" and that some datasets "should remain private or semiprivate to avoid training data contamination." Anthropic reports CBRN automated benchmarks for Opus 4.6 are "largely saturated and no longer provide meaningful signal for rule-out."

Cyber: OpenAI treated GPT-5.3-Codex as "High" cyber capability because it "cannot rule out" reaching the threshold---a precautionary classification driven by eval limitations rather than definitive evidence. The model achieved 77.6% on cybersecurity CTF challenges, but CTF-based evaluation has structural limitations: CTF scenarios are standardized, well-bounded, and may not reflect real-world offensive cyber capability, which requires combining exploits, social engineering, and operational security in messy, partially-observable environments.

AI R&D: METR's HCAST benchmark contains 189 tasks across machine-learning engineering, cybersecurity, software engineering, and general reasoning, with 563 human baselines totaling over 1,500 hours. Current models succeed 70-80% of the time on tasks taking humans less than one hour, but less than 20% on tasks taking 4+ hours. Despite this, METR notes GPT-5 is "getting close to saturating many of the tasks in HCAST," motivating the 34% expansion to Time Horizon 1.1. The RE-Bench component focuses specifically on ML research engineering tasks. Anthropic's Opus 4.6 AI R&D evaluations are "saturated" and "no longer provide useful evidence for ruling out ASL-4 level autonomy."

The Evals Gap

Core Concept

Apollo Research's Marius Hobbhahn articulated the "evals gap" in a November 2025 post: the quality and quantity of evaluations required to make rigorous safety statements could outpace the evaluations actually available. The core claims are:

Policymakers and industry players overestimate the number of currently available high-quality evals
They underestimate the time it takes to develop rigorous new evaluations
Safety evaluations underpin governance frameworks (RSPs, future regulations), but without confidence in their rigor, they are "unsuitable for high-stakes decisions"

This builds on Hobbhahn's earlier (early 2024) call for a "science of evals," arguing that model evaluation needs to mature as a scientific discipline comparable to materials testing or aviation safety certification— rigorous, standardized, and predictive.

International Validation

The International AI Safety Report 2026 validated this concern at the highest level:

"There is an 'evaluation gap': results from pre-deployment tests do not reliably predict real-world performance, which makes it difficult to anticipate limitations and societal impacts."

The report also noted that "many common capability evaluations are outdated, affected by data contamination, or focus on a narrow set of tasks, providing limited insight into real-world AI performance."

Evaluation Awareness: A Structural Challenge

Perhaps the most fundamental dimension of the evals gap is evaluation awareness---models increasingly recognizing when they are being tested and potentially behaving differently.

Model	Evaluation Awareness Rate	Source
Claude Opus 4.1	22% of test scenarios	Apollo Research
Claude Sonnet 4.5	58% of test scenarios	Apollo Research
OpenAI o3	2.3% baseline; 4.5% post-anti-scheming training	Apollo Research
OpenAI o4-mini	0.9% baseline; 5.5% post-anti-scheming training	Apollo Research
Capabilities-focused training run (no safety training)	2.0% → 20.6%	Apollo Research

The last row is particularly concerning: a model trained solely for capabilities (without safety training) showed evaluation awareness increasing tenfold, suggesting future more capable models may be substantially more evaluation-aware regardless of safety interventions.

The International AI Safety Report 2026 warned: "It has become more common for models to distinguish between test settings and real-world deployment and to find loopholes in evaluations, which could allow dangerous capabilities to go undetected before deployment."

For detailed analysis of evaluation awareness mechanisms, countermeasures, and the broader question of whether behavioral evaluation faces fundamental limits, see the dedicated page.

Chain-of-Thought Faithfulness: A Compounding Problem

Even when evaluations are not explicitly detected, models may reason differently than their outputs suggest. Anthropic's "Reasoning Models Don't Always Say What They Think" (May 2025) found:

Model	Overall CoT Faithfulness	Faithfulness on Misaligned Hints
Claude 3.7 Sonnet	25%	20%
DeepSeek R1	39%	29%

Outcome-based reinforcement learning initially boosts faithfulness (up to 63% improvement) but plateaus at low scores: 28% on MMLU, 20% on GPQA. This means models frequently use information from hints without acknowledging doing so in their chain-of-thought reasoning---and this unfaithfulness is worse for misaligned hints (grader hacking, unethically-obtained information) than for neutral ones.

This compounds the evaluation awareness problem: even if models do not explicitly detect evaluation contexts, their chain-of-thought may not accurately reflect the factors driving their behavior. OpenAI's research on CoT monitorability (December 2025) found that while GPT-4o monitors achieve "almost perfect recall in detecting cheating trajectories" when models are forthright, directly penalizing "bad thoughts" causes models to hide intent rather than stop misbehaving---a phenomenon termed "obfuscated reward hacking."

Capability vs. Eval Development Rates

Diagram (loading…)

flowchart TD
  CAP[AI Capabilities Growth] -->|"Doubling every 4-7 months"| SAT[Benchmark Saturation]
  SAT --> SAFETY[Safety Eval Signal Loss]
  SAT --> ACADEMIC[Academic Benchmark Obsolescence]

  EVAL[Eval Development Rate] -->|"Months to years per eval suite"| GAP[Growing Evals Gap]
  SAFETY --> GAP
  ACADEMIC --> GAP

  GAP --> RSP[RSP Threshold Detection Weakens]
  GAP --> DEPLOY[Deployment Decisions Less Informed]
  GAP --> AWARE[Evaluation Awareness Grows]

  AWARE -->|"Feedback loop"| GAP

  subgraph Mitigations["Potential Mitigations"]
      LLM[LLM-as-Judge Scaling]
      ADAPT[Adaptive Benchmarks]
      INTERP[Interpretability Research]
      PROD[Production Monitoring]
  end

  Mitigations -.-> GAP

  style GAP fill:#ff6b6b
  style RSP fill:#fff3cd
  style DEPLOY fill:#fff3cd
  style AWARE fill:#ff6b6b
  style Mitigations fill:#d4edda

The Asymmetry

The fundamental dynamics driving eval saturation:

Factor	Capability Side	Evaluation Side
Investment	$400B+ combined (major labs)	$30-60M/year external eval orgs
Scaling	Compute scaling yields automatic capability gains	Each new eval requires manual expert design
Speed	Models improve with each training run	Eval suites take months to develop and validate
Incentive	Strong commercial incentives to improve models	Weaker incentives to improve evaluation rigor
Automation	Training is highly automated	Eval creation requires significant human expertise
Adversarial dynamics	Models may learn to game evaluations	Evaluators must anticipate all gaming strategies

Can Evaluation Keep Pace? Arguments For and Against

Arguments That Eval Development Can Keep Up

LLM-as-Judge scaling: As of 2025-2026, 53.3% of practitioners use LLM-as-judge approaches to scale evaluation. OpenAI has invested in "a growing suite of automated evaluations that can keep up with a faster deployment cadence." Anthropic's Bloom framework enables automated behavioral evaluation generation.

Adversarial benchmark resurrection: A February 2025 paper demonstrated that saturated benchmarks could be made useful again by pairing questions (requiring models to answer two simultaneously) and adding more answer options, "predictably reducing performance and essentially unsaturating" the benchmark.

Adaptive and dynamic benchmarks: The field is shifting toward evaluations that dynamically generate tasks based on model capabilities, agent-focused evaluations measuring multi-step real-world workflows, and time horizon-based measurement (METR's approach of measuring task length rather than static accuracy).

The benchmark creation pipeline continues: The community has repeatedly created harder benchmarks in response to saturation (MMLU → MMLU-Pro → HLE → FrontierMath). Epoch AI is "developing new benchmarks to track progress in mathematical reasoning even as FrontierMath Tier 4 saturates."

Arguments That Eval Development Cannot Keep Up

Time-to-saturation is shrinking: Each new benchmark saturates faster than its predecessor. This trend suggests the problem is accelerating, not stabilizing.

Safety evals are harder to create than academic benchmarks: Domain-specific evaluations for CBRN, cyber, and autonomous capabilities require specialized expertise, security clearances, and careful threat modeling. Unlike academic benchmarks, they cannot be crowdsourced.

Evaluation awareness undermines the paradigm: If models learn to detect evaluation environments and behave differently, the entire concept of pre-deployment testing becomes unreliable regardless of evaluation quality.

Real-world capability diverges from benchmark performance: Models can saturate academic tests while still failing at practical tasks, and vice versa. As one observer noted: "the first model that saturates HLE will still be unemployable as a software engineer," suggesting benchmarks increasingly measure the wrong thing.

Qualitative governance shifting: Anthropic's RSP v2.2 moved away from quantitative benchmarks toward qualitative descriptions and internal staff surveys, which SaferAI criticized as creating vulnerability to "moving the goalposts" under competitive pressure. Anthropic's SaferAI grade dropped from 2.2 to 1.9 after this update.

The eval development lag: METR notes that eval development lags capability growth by 6-12 months on average. For Opus 4.6, Anthropic had to rely on an internal survey of 16 staff members rather than automated evaluations---hardly a scalable governance mechanism.

Implications for AI Safety Governance

Impact on Responsible Scaling Policies

Eval saturation poses direct threats to the evaluation-based governance frameworks that currently form the backbone of frontier AI risk management:

Impact	Description	Evidence
Threshold detection failure	Evals cannot reliably detect when models cross dangerous capability thresholds	Anthropic: "clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model"
Shift to qualitative standards	Labs moving from quantitative benchmarks to qualitative descriptions	Anthropic RSP v2.2; SaferAI critique
Precautionary classification	Labs must assume worst-case capabilities when evals cannot distinguish	OpenAI treating GPT-5.3-Codex as "High" cyber despite lacking definitive evidence
Evaluation interval mismatch	Anthropic extended evaluation intervals to 6 months while models advance faster	RSP v2.2 changes
Internal survey dependence	Safety determinations resting on small expert polls rather than empirical evaluation	Anthropic Opus 4.6: 16-person internal survey for AI R&D risk assessment
CEO override provisions	Framework structures allow executive override of safety recommendations	OpenAI Preparedness Framework v2: CEO can reject SAG recommendations; board oversight is "requested, but not demanded"

A 2025 affordance analysis of OpenAI's Preparedness Framework concluded that it "does not guarantee any AI risk mitigation practices," noting that "OpenAI Leadership" is defined as "the CEO or a person designated by them," giving the CEO significant unilateral authority over deployment decisions even when evaluations indicate risk.

The ASL-4 Challenge

Anthropic's "Three Sketches of ASL-4 Safety Case Components" (2024) identified what may be the ultimate eval saturation scenario:

"Developers need to demonstrate that their evaluation and monitoring procedures are robust to an AI which is motivated to undermine them and which appears to have the capability to do so."

At ASL-4 capability levels, models may be sophisticated enough to strategically game any evaluation designed to test them. Current evaluation approaches offer no clear path to addressing this challenge.

Defense-in-Depth Response

Given evaluation limitations, multiple organizations are calling for evaluation to be one layer in a broader safety approach:

Layer	Approach	Depends on Evals?
Pre-deployment evaluation	Current RSP/preparedness approach	Yes (directly undermined)
Interpretability	Understanding model internals, not just behavioral outputs	Partially
Production monitoring	Evaluating models in deployment, not just pre-deployment (see scalable eval approaches)	Partially
White-box probes	Linear probes to detect evaluation awareness internally	Partially
Structural safety	Deployment controls, access restrictions, compute limits	No
Control evaluations	Worst-case assumption that model is scheming	Partially

Current Landscape of Responses

Key Organizations

Organization	Response to Eval Saturation	Key Contribution
METR	Expanding task suites, Time Horizon methodology	Time Horizon 1.1 (Jan 2026); continuous task difficulty expansion
Apollo Research	Articulating the evals gap, evaluation awareness research	"The Evals Gap" (Nov 2025); "Science of Evals" framework
Anthropic	Moving to qualitative assessments, internal surveys	Bloom automated eval framework; staff surveys for saturated domains
OpenAI	Precautionary classification, automated eval scaling	Preparedness Framework v2; production evaluation research
UK AI Security Institute	Government-backed independent testing	30+ frontier models evaluated; bounty program for novel evaluations; Frontier AI Trends Report
Epoch AI	Developing next-generation math benchmarks; biorisk eval analysis	FrontierMath; biorisk eval critique; planning post-saturation benchmarks
Center for AI Safety	Frontier-difficulty benchmarks	Humanity's Last Exam; WMDP (Weapons of Mass Destruction Proxy)

Emerging Methodological Approaches

Approach	Description	Status	Limitations
Time horizon measurement	Measure task length AI can complete rather than accuracy on fixed tests	Operational (METR)	Task suite still saturates; needs continual expansion
LLM-as-judge	Use frontier models to evaluate other models	Widely adopted (53.3%)	Circular if judge model is weaker than evaluated model
Adversarial encoding	Make saturated benchmarks harder via question pairing	Research (2025)	Reasoning models may resist adversarial encodings
Automated behavioral evals	LLM-generated evaluation suites (Anthropic Bloom)	Operational (Dec 2025)	Still constrained by what evaluators think to test for
Production evals	Monitoring real-world model behavior in deployment	Research phase	Privacy concerns; sampling challenges
Scheming control evals	Worst-case assumption: model is competently scheming	Active research	Defining "worst case" is itself an eval challenge

Benchmark Saturation Status (February 2026)

The following tables document the saturation status of major AI benchmarks across categories. "Saturated" means frontier models cluster near or above the benchmark's ceiling, providing minimal differentiation between models.

Knowledge & Understanding

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
MMLU	2020	≈90%	≈91% (error ceiling)	Saturated	Gemini 3 Pro	Frontier models cluster >88%. Dropped from Hugging Face leaderboard v2.
MMLU-Pro	2024	90.1%	≈90%	Saturated	Gemini 3 Pro	12k grad-level Qs, 10 answer choices. Saturated ≈18 months after introduction.
GPQA Diamond	2023	92.6%	≈80% (expert est.)	Saturated	Gemini 3 Pro	Models exceed original expert-estimated ceiling. Human domain experts score ≈81%.
ARC-Challenge	2018	≈96%+	≈100%	Saturated	Frontier models	Replaced by ARC-AGI and ARC-AGI-2.
SimpleQA	2024	≈55-60%	100% (factuality)	Not saturated	Gemini 2.5 Pro	Hard factuality benchmark. o1-preview scored 42.7% at release.

Math & Reasoning

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
GSM8K	2021	≈97%+	100%	Saturated	Multiple frontier	Grade-school math. Nearly all frontier models >95%.
MATH-500	2021	97.3%	100%	Saturated	DeepSeek R1	Competitive math. No longer discriminative.
BigBench-Hard	2022	>90%	≈100%	Saturated	Frontier models	Replaced by BBEH, where best model scores 54.2%.
ARC-AGI-1	2019	90.5%	≈100%	Near saturation	GPT-5.2 Pro	First model to pass 90%. Being superseded by ARC-AGI-2.
FrontierMath T1-3	2024	40.3%	100%	Not saturated	GPT-5.2 Thinking	Undergrad-to-grad math. Opus 4.5 at 21%.
FrontierMath T4	2024	31%	100%	Not saturated	GPT-5.2 Pro	Research-level. Only 17/48 questions solved across all models.
ARC-AGI-2	2024	54.2%	≈100%	Not saturated	GPT-5.2 Pro	Human paid participants ≈50%. Claude Opus 4.5 at 37.6%.

Coding

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
HumanEval	2021	96.3%	100%	Near saturation	o1 Preview	Data contamination concerns. Being replaced by EvalPlus.
MBPP	2021	94.2%	100%	Near saturation	Claude 3.5 Sonnet	Entry-level Python. Saturated for practical purposes.
SWE-bench Verified	2024	80.9%	100%	Not saturated	Claude Opus 4.5	First model to break 80%. Gemini 3 Pro at ≈77%.
SWE-bench Full	2023	≈46%	100%	Not saturated	Claude Opus 4.5	Much harder than Verified subset.
LiveCodeBench	2024	83.1	≈100	Not saturated	Kimi K2 Thinking	Contamination-free competitive programming. Continuously refreshed.
Aider Polyglot	2024	89.4%	100%	Near saturation	Claude Opus 4.5	225 Exercism problems across 6 languages.

Science & Safety

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
HellaSwag	2019	95.3%	95.6% (human)	Saturated	GPT-4	Commonsense reasoning. Matches human-level.
VCT (Virology)	2025	43.8%	≈22% (expert avg)	Exceeds experts	o3	Outperforms 94% of expert virologists on subspecialties.
WMDP	2024	≈65-70%	N/A	Not saturated	GPT-4	3,668 Qs on biosecurity, cyber, chemical. Higher = more hazardous knowledge.
MMMU	2023	85.4%	≈88.6% (expert)	Near saturation	GPT-5.1	Multi-discipline multimodal understanding. Within expert range.
MathVista	2023	80.1%	≈60.3% (human avg)	Exceeds humans	Kimi-VL-A3B-Thinking	Math reasoning in visual contexts.

Agent & Long-Horizon

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
WebArena	2023	≈71.2%	100%	Not saturated	Top agent config	Human-level web tasks. Gemini 2.5 Pro at 54.8%.
OSWorld	2024	≈41.4%	72.4% (human)	Not saturated	Hybrid agent	OS-level task execution. Well below human baseline.
METR Time Horizon	2025	2h17m (50% threshold)	Open-ended	Not saturated	GPT-5	Doubling every ≈7 months. 228 tasks in TH1.1.

Frontier Difficulty & Language

Benchmark	Year	Best Score	Ceiling	Status	Best Model	Notes
HLE	2025	37.2%	98%+ (expert)	Not saturated	Gemini 3 Pro Preview	2,500 expert-level Qs across 100+ subjects. Started at ≈3%.
Chatbot Arena	2023	1492 Elo	Open-ended	Not saturated	Gemini 3 Pro	Relative ranking. Rankings change weekly.
AlpacaEval 2.0	2023	≈65.7% (LC)	100%	Not saturated	Self-MoA configs	Length-controlled win rate vs GPT-4 reference.

Saturation Summary

Category	Saturated	Near Saturation	Not Saturated	Exceeds Humans
Knowledge	4	0	1	0
Math/Reasoning	3	1	3	0
Coding	0	3	3	0
Science/Safety	1	1	1	2
Agent	0	0	3	0
Frontier/Language	0	0	3	0
Total	8	5	14	2

Knowledge and math benchmarks from 2018-2022 are largely saturated, while coding, agent, and frontier-difficulty benchmarks from 2023-2025 retain signal. The time-to-saturation trend (MMLU ~4 years → MMLU-Pro ~18 months → HLE approaching saturation within ~12 months) suggests even harder benchmarks face accelerating timelines. Only open-ended or continuously-refreshed measures (METR Time Horizon, Chatbot Arena, LiveCodeBench) appear structurally resistant to saturation.

Key Uncertainties

Can AI-assisted eval creation keep pace with AI capability growth? If LLMs can generate harder evaluations as fast as they can solve easier ones, the gap may stabilize rather than widen.
Is evaluation awareness a solvable engineering problem or a fundamental limit? If models can always detect evaluation contexts, pre-deployment testing becomes unreliable at any scale.
Will qualitative governance be sufficient? As labs shift from quantitative benchmarks to expert judgment and staff surveys, the rigor and reproducibility of safety determinations may decline.
Can interpretability substitute for behavioral evaluation? If researchers can directly inspect model internals for dangerous capabilities, behavioral benchmark saturation matters less.
How will regulatory frameworks adapt? The EU AI Act and potential US legislation assume evaluation-based compliance. If evaluations lose signal, the regulatory model may need fundamental redesign.

Sources & Further Reading

Primary Research

Apollo Research: The Evals Gap (November 2025) — Core articulation of the evals gap problem
Apollo Research: We Need a Science of Evals (Early 2024) — Methodological maturation argument
Apollo Research: Stress Testing Anti-Scheming Training — Evaluation awareness data
METR: Time Horizon 1.1 (January 2026) — Task suite expansion in response to saturation
METR: GPT-5 Evaluation Report — HCAST saturation documented
Ivanov & Volkov: Resurrecting Saturated LLM Benchmarks (arXiv:2502.06738) (February 2025) — Adversarial encoding methodology

Lab Reports

Anthropic: Claude Opus 4.6 Announcement (February 2026) — Eval saturation across CBRN, cyber, AI R&D
Anthropic: Activating ASL-3 Report (May 2025) — Difficulty of ruling out ASL-3 risks
Anthropic: Reflections on RSP (2024) — Evaluation methodology challenges
Anthropic: RSP v2.2 (May 2025) — Shift to qualitative thresholds
Anthropic: Three Sketches of ASL-4 Safety Cases (2024) — ASL-4 evaluation challenge
OpenAI: GPT-5.3-Codex System Card (February 2026) — Precautionary "High" cyber classification
OpenAI: Preparedness Framework v2 (April 2025) — Evaluation limitations acknowledged

CoT Faithfulness & Monitoring

Anthropic: Reasoning Models Don't Always Say What They Think (May 2025) — 25% CoT faithfulness for Claude 3.7 Sonnet; 20% on misaligned hints
OpenAI: Evaluating Chain-of-Thought Monitorability (December 2025) — CoT monitoring framework; obfuscated reward hacking
OpenAI: Detecting Misbehavior in Frontier Reasoning Models — Monitor achieves near-perfect recall on forthright models

Domain-Specific Evaluation

Epoch AI: Do the Biorisk Evaluations Actually Measure Risk? — "Publicly described biorisk benchmarks are essentially entirely saturated"
RAND: Global Risk Index for AI-enabled Biological Tools — Biorisk evaluation methodology
METR: HCAST (arXiv 2503.17354) — 189 tasks, 563 human baselines, 1,500+ hours

Policy Reports

International AI Safety Report 2026 — Formal identification of "evaluation gap"
SaferAI: Anthropic RSP Critique — Analysis of qualitative governance shift
METR: Common Elements of Frontier AI Safety Policies — Cross-lab evaluation comparison
OpenAI Preparedness Framework Affordance Analysis (arXiv 2509.24394) — "Does not guarantee any AI risk mitigation practices"

Benchmark Data

Mapping Global Dynamics of Benchmark Creation and Saturation (Nature Communications 2022) — Study of 3,765 benchmarks showing rapid saturation pattern
LLM-Stats Benchmarks 2026 — Comprehensive benchmark saturation tracking
Epoch AI: FrontierMath — Math benchmark development
Scale AI/CAIS: Humanity's Last Exam Leaderboard — HLE performance tracking
Life Architect: Mapping IQ, MMLU, MMLU-Pro, GPQA, HLE — Cross-benchmark comparison
Artificial Analysis: MMLU-Pro — MMLU-Pro leaderboard and trends
Artificial Analysis: GPQA Diamond — GPQA Diamond leaderboard
ARC Prize: 2025 Results — ARC-AGI-2 performance analysis
SecureBio: Virology Capabilities Test — AI vs. expert virologists
Aider LLM Leaderboards — Coding benchmark tracking
LMArena / Chatbot Arena — Crowdsourced model ranking
WebArena — Agent web task benchmark
OSWorld — OS-level agent benchmark
Epoch AI: METR Time Horizons — Time horizon visualization

References

1AISI Frontier AI TrendsUK AI Safety Institute·Government▸

A UK AI Safety Institute government assessment documenting exponential performance improvements across frontier AI systems in multiple domains. The report evaluates emerging capabilities and associated risks, calling for robust safeguards as systems advance rapidly. It serves as an official benchmark of the current frontier AI landscape from a national safety authority.

★★★★☆

aisi.gov.uk

2Time Horizon 1.1 - METRMETR▸

★★★★☆

metr.org

3Details about METR’s evaluation of OpenAI GPT-5METR▸

METR conducted an independent third-party evaluation of OpenAI's GPT-5 to assess catastrophic risk potential across three threat models: AI R&D automation, rogue replication, and strategic sabotage. The evaluation found GPT-5 has a 50% time-horizon of ~2 hours 17 minutes on agentic software engineering tasks, and concluded it does not currently pose catastrophic risks under these threat models. The report also assesses risks from incremental further development prior to public deployment.

★★★★☆

evaluations.metr.org

4Introducing Claude Opus 4.6Anthropic·Blog post▸

Anthropic announces Claude Opus 4.6, an upgraded frontier model with enhanced agentic coding capabilities, a 1M token context window in beta, and state-of-the-art performance on benchmarks including Terminal-Bench 2.0, Humanity's Last Exam, and GDPval-AA. The release includes new features like adaptive thinking, effort controls, agent team coordination in Claude Code, and expanded productivity integrations. Anthropic claims the model maintains a safety profile as good as or better than any other frontier model.

★★★★☆

anthropic.com

5Anthropic acknowledgedAnthropic▸

Anthropic reflects on the development, implementation, and lessons learned from their Responsible Scaling Policy (RSP), which establishes safety standards and evaluation thresholds tied to AI capability levels. The post discusses how the RSP has evolved, what has worked, and areas for improvement as AI systems become more capable.

★★★★☆

anthropic.com

6Responsible Scaling PolicyAnthropic▸

Anthropic's Responsible Scaling Policy (RSP) is a formal commitment outlining how the company will evaluate AI systems for dangerous capabilities and adjust deployment and development practices accordingly. It introduces 'AI Safety Levels' (ASL) analogous to biosafety levels, establishing thresholds that trigger specific safety and security requirements before proceeding. The policy aims to prevent catastrophic misuse while allowing continued AI development.

★★★★☆

anthropic.com

7OpenAI: Preparedness Framework Version 2OpenAI▸

OpenAI's Preparedness Framework v2 outlines the company's structured approach to evaluating and managing catastrophic risks from frontier AI models, including definitions of risk severity levels and thresholds that determine whether a model can be deployed or developed further. It establishes a systematic process for tracking, evaluating, and preparing for frontier model risks across domains such as CBRN threats, cyberattacks, and loss of human control. The framework represents OpenAI's operationalized safety commitments with concrete governance mechanisms.

★★★★☆

cdn.openai.com

8OpenAI CoT MonitoringOpenAI▸

OpenAI describes their approach to monitoring chain-of-thought (CoT) reasoning in large language models as a safety measure, examining whether models' visible reasoning steps are faithful and can detect deceptive or misaligned behavior. The work explores using CoT transparency as a tool for alignment verification and identifying specification gaming or reward hacking patterns before deployment.

★★★★☆

openai.com

9SaferAI: Anthropic's Responsible Scaling Policy Update Is a Step Backwardssafer-ai.org▸

SaferAI critiques Anthropic's updated Responsible Scaling Policy (RSP), arguing that recent revisions weaken safety commitments rather than strengthening them. The analysis contends that the updated policy relaxes key thresholds and evaluation requirements, reducing accountability for frontier AI deployment. This represents a critical external perspective on how voluntary safety frameworks can erode over time.

safer-ai.org

10METR: Common Elements of Frontier AI Safety PoliciesMETR▸

METR analyzes the common structural elements across frontier AI safety policies published by major AI companies, identifying shared frameworks around capability thresholds, model evaluations, weight security, deployment mitigations, and accountability mechanisms. The December 2025 version covers twelve companies including Anthropic, OpenAI, Google DeepMind, Meta, and others, and incorporates references to the EU AI Act's General-Purpose AI Code of Practice and California's Senate Bill 53.

★★★★☆

metr.org

11MMLU-Pro Benchmark Leaderboard – Artificial Analysisartificialanalysis.ai▸

This page presents the MMLU-Pro benchmark leaderboard hosted by Artificial Analysis, featuring independently conducted evaluations of language models on 12,000 graduate-level questions across 14 subjects with ten answer choices. MMLU-Pro addresses saturation in the original MMLU by emphasizing deeper reasoning over knowledge recall, causing accuracy drops of 16–33% compared to MMLU. It also demonstrates greater prompt stability and rewards Chain-of-Thought reasoning, making it a more discriminative tool for tracking AI progress.

artificialanalysis.ai

12ARC Prize 2024-2025 resultsarcprize.org▸

Comprehensive analysis of the ARC Prize competition results for 2024-2025, evaluating AI systems' performance on the Abstraction and Reasoning Corpus (ARC) benchmark designed to test general fluid intelligence. The results provide insight into the current state of AI reasoning capabilities and how close frontier models are to human-level performance on novel problem-solving tasks.

arcprize.org

13WebArena: A Realistic Web Environment for Agentic AI Evaluationwebarena.dev▸

WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos, etc.). It tests agents' ability to complete complex multi-step goals requiring planning, navigation, and tool use in a self-hosted web ecosystem. The benchmark helps measure progress and identify limitations in agentic AI systems operating in realistic digital environments.

webarena.dev

14OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsos-world.github.io▸

OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability to use GUIs, execute code, and interact with real applications like browsers, file systems, and productivity tools. The benchmark reveals that current state-of-the-art models achieve very low success rates compared to humans, highlighting a significant capability gap.

os-world.github.io

Eval Saturation & The Evals Gap