METR
- Quant.METR found that time horizons for AI task completion are doubling every 4 months (accelerated from 7 months historically), with GPT-5 achieving 2h17m and projections suggesting AI systems will handle week-long software tasks within 2-4 years.S:4.5I:4.5A:4.0
- GapMETR's evaluation-based safety approach faces a fundamental scalability crisis, with only ~30 specialists evaluating increasingly complex models across multiple risk domains, creating inevitable trade-offs in evaluation depth that may miss novel dangerous capabilities.S:4.0I:4.5A:4.5
- ClaimCurrent frontier AI models show concerning progress toward autonomous replication and cybersecurity capabilities but have not yet crossed critical thresholds, with METR serving as the primary empirical gatekeeper preventing potentially catastrophic deployments.S:3.5I:5.0A:4.0
- QualityRated 66 but structure suggests 93 (underrated by 27 points)
- Links4 links could use <R> components
METR
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Mission Criticality | Very High | Only major independent organization conducting pre-deployment dangerous capability evaluations for frontier AI labs |
| Research Output | High | 77-task autonomy evaluation suite; time horizons paper showing 7-month doubling; RE-Bench; MALT dataset (10,919 transcripts) |
| Industry Integration | Strong | Pre-deployment evaluations for OpenAI (GPT-4, GPT-4.5, GPT-5, o3), Anthropic (Claude 3.5, Claude 4), Google DeepMind |
| Government Partnerships | Growing | UK AI Safety Institute methodology partner; NIST AI Safety Institute Consortium member |
| Funding | $17M via Audacious Project | Project Canary collaboration with RAND received $38M total; METR portion is $17M |
| Independence | Strong | Does not accept payment from labs for evaluations; uses donated compute credits; maintains editorial independence |
| Key Metric: Task Horizon Doubling | 7 months (accelerating to 4 months) | March 2025 research found doubling from 7 months to 4 months in 2024-2025 |
| Evaluation Coverage | 12 companies analyzed | December 2025 analysis of frontier AI safety policies |
Organization Details
Section titled “Organization Details”| Attribute | Details |
|---|---|
| Full Name | Model Evaluation and Threat Research |
| Founded | December 2023 (spun off from ARC Evals) |
| Founder & CEO | Beth Barnes↗🔗 webBeth BarnesSource ↗Notes (formerly OpenAI, DeepMind) |
| Location | Berkeley, California |
| Status | 501(c)(3) nonprofit research institute |
| Funding | $17M via Audacious Project (Oct 2024); $38M total for Project Canary (METR + RAND collaboration) |
| Key Partners | OpenAI, Anthropic, Google DeepMind, Meta, UK AI Safety Institute, NIST AI Safety Institute Consortium |
| Evaluation Focus | Autonomous replication, cybersecurity, CBRN, manipulation, AI R&D capabilities |
| Task Suite | 77-task Autonomous Risk Capability Evaluation; 180+ ML engineering, cybersecurity, and reasoning tasks |
| Funding Model | Does not accept payment from labs; uses donated compute credits to maintain independence |
Overview
Section titled “Overview”METR (Model Evaluation and Threat Research), formerly known as ARC Evals, stands as the primary organization evaluating frontier AI models for dangerous capabilities before deployment. Founded in 2023 as a spin-off from Paul Christiano’s Alignment Research Center, METR serves as the critical gatekeeper determining whether cutting-edge AI systems can autonomously acquire resources, self-replicate, conduct cyberattacks, develop weapons of mass destruction, or engage in catastrophic manipulation. Their evaluations directly influence deployment decisions at OpenAI, Anthropic, Google DeepMind, and other leading AI developers.
METR occupies a unique and essential position in the AI safety ecosystem. When labs develop potentially transformative models, they turn to METR with the fundamental question: “Is this safe to release?” The organization’s rigorous red-teaming and capability elicitation provides concrete empirical evidence about dangerous capabilities, bridging the gap between theoretical AI safety concerns and practical deployment decisions. Their work has already prevented potentially dangerous deployments and established industry standards for pre-release safety evaluation.
The stakes of METR’s work cannot be overstated. As AI systems approach and potentially exceed human-level capabilities in critical domains, the window for implementing safety measures narrows rapidly. METR’s evaluations represent one of the few concrete mechanisms currently in place to detect when AI systems cross thresholds that could pose existential risks to humanity. Their findings directly inform not only commercial deployment decisions but also government regulatory frameworks, making them a linchpin in global efforts to ensure advanced AI development remains beneficial rather than catastrophic.
History and Evolution
Section titled “History and Evolution”Origins as ARC Evals (2021-2023)
Section titled “Origins as ARC Evals (2021-2023)”The organization’s roots trace to 2021 when Paul Christiano established the Alignment Research Center (ARC) with two distinct divisions: Theory and Evaluations. While ARC Theory focused on fundamental alignment research like Eliciting Latent Knowledge (ELK), the Evaluations team, co-led by Beth Barnes, concentrated on the practical challenge of testing whether AI systems possessed dangerous capabilities. This division reflected a growing recognition that theoretical safety research needed to be complemented by empirical assessment of real-world AI systems.
The team’s breakthrough moment came with their evaluation of GPT-4 in late 2022 and early 2023, conducted before OpenAI’s public deployment. This landmark assessment tested whether the model could autonomously replicate itself to new servers, acquire computational resources, obtain funding, and maintain operational security—capabilities that would represent a fundamental shift toward autonomous AI systems. The evaluation found that while GPT-4 could perform some subtasks with assistance, it was not yet capable of fully autonomous operation, leading to OpenAI’s decision to proceed with deployment. This evaluation, documented in OpenAI’s GPT-4 System Card, established the template for pre-deployment dangerous capability assessments that has since become industry standard.
As demand for such evaluations grew across multiple frontier labs, the Evaluations team found itself increasingly distinct from ARC’s theoretical research mission. The need for independent organizational structure, separate funding streams, and a dedicated focus on capability assessment drove the decision to spin off into an independent organization in 2023.
Transformation to METR (2023-2024)
Section titled “Transformation to METR (2023-2024)”The rebranding to METR (Model Evaluation and Threat Research) in 2023 marked the organization’s emergence as the de facto authority on dangerous capability evaluation. Under Beth Barnes’ continued leadership, METR rapidly expanded its scope beyond autonomous replication to encompass cybersecurity, CBRN (chemical, biological, radiological, nuclear), manipulation, and other catastrophic risk domains. The organization formalized contracts with all major frontier labs, establishing regular evaluation protocols that became integral to their safety frameworks.
Throughout 2024, METR’s influence expanded dramatically. The organization played crucial roles in informing OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy, and Google DeepMind’s Frontier Safety Framework. These partnerships transformed METR from an external consultant to an essential component of the AI safety infrastructure, with deployment decisions at major labs now contingent on METR’s assessments. The organization also began collaborating with government entities, including the UK AI Safety Institute and US government bodies implementing AI Executive Order requirements.
METR’s current position represents a remarkable evolution from a small research team to a critical piece of global AI governance infrastructure. The organization now employs approximately 30 specialists, conducts regular evaluations of new frontier models, and has established itself as the authoritative voice on dangerous capability assessment. Their methodologies are being adopted internationally, their thresholds inform regulatory frameworks, and their findings shape public discourse about AI safety.
Key Publications and Evaluations
Section titled “Key Publications and Evaluations”| Publication/Evaluation | Date | Key Findings | Link |
|---|---|---|---|
| GPT-5 Autonomy Evaluation | 2025 | First comprehensive (non-”preliminary”) evaluation; 50%-time horizon of 2h17m; no evidence of strategic sabotage but model shows eval awareness | evaluations.metr.org↗🔗 webevaluations.metr.orgSource ↗Notes |
| GPT-5.1-Codex-Max Evaluation | 2025 | Unlikely to pose significant catastrophic risks via self-improvement or rogue replication; capability trends continue rapidly | evaluations.metr.org↗🔗 webevaluations.metr.orgSource ↗Notes |
| Measuring AI Ability to Complete Long Tasks | March 2025 | Task length doubling time of ~7 months (accelerating to ~4 months in 2024-2025); 50%-time horizon of ≈50 minutes for Claude 3.7 Sonnet | arXiv:2503.14499↗📄 paper★★★☆☆arXivarXiv:2503.14499Thomas Kwa, Ben West, Joel Becker et al. (2025)Source ↗Notes |
| MALT Dataset | October 2025 | 10,919 reviewed transcripts for studying reward hacking and sandbagging; best monitors achieve 0.96 AUROC for detecting reward hacking | metr.org↗🔗 web★★★★☆METRmetr.orgSource ↗Notes |
| RE-Bench | November 2024 | ML research engineering benchmark; agents achieve 4x human performance at 2h but humans outperform 2x at 32h | arXiv:2411.15114↗📄 paper★★★☆☆arXivarXiv:2411.15114Hjalmar Wijk, Tao Lin, Joel Becker et al. (2024)Source ↗Notes |
| Common Elements of Frontier AI Safety Policies | Aug 2024, updated Dec 2025 | Analysis of 12 companies’ safety policies; all use capability thresholds for CBRN, cyber, autonomous replication | metr.org↗🔗 web★★★★☆METRMETR's analysis of 12 companiesSource ↗Notes |
| GPT-4.5 Pre-deployment Evaluation | February 2025 | Preliminary dangerous capability assessment conducted in partnership with OpenAI | metr.org↗🔗 web★★★★☆METRmetr.orgSource ↗Notes |
| GPT-4 Autonomous Replication Assessment | 2023 | Found GPT-4 not yet capable of fully autonomous operation; established template for pre-deployment evaluations | Documented in OpenAI GPT-4 System Card↗🔗 webOpenAISource ↗Notes |
| DeepSeek-R1 Evaluation | 2025 | Comparable to o1-preview (Sept 2024) in autonomous capabilities; no evidence of capability to commit cyberattacks or self-replicate | evaluations.metr.org |
| DeepSeek-V3 Evaluation | 2025 | Preliminary assessment of autonomous capabilities | evaluations.metr.org |
| Developer Productivity RCT | July 2025 | First randomized controlled trial measuring AI impact on experienced open-source developers | METR blog |
METR’s Role in the AI Safety Ecosystem
Section titled “METR’s Role in the AI Safety Ecosystem”Comparative Model Performance on METR Evaluations
Section titled “Comparative Model Performance on METR Evaluations”| Model | 50% Time Horizon | Task Completion Rate (1h tasks) | Autonomous Replication Risk | Cyberattack Capability | Assessment Date |
|---|---|---|---|---|---|
| GPT-5 | 2h 17m (95% CI: 65m-4h25m) | Higher than o3 in 96% of bootstrap samples | No evidence of catastrophic risk | Below threshold | 2025 |
| OpenAI o3 | 1h 30m | Greater than 40% success | Below threshold | Elevated monitoring | 2025 |
| Claude 3.7 Sonnet | ≈50 minutes | Greater than 40% success | Below threshold | Below threshold | 2025 |
| GPT-4.5 | N/A (preliminary) | Moderate | Below threshold | Below threshold | Feb 2025 |
| DeepSeek-R1 | Comparable to o1-preview | Comparable to Sept 2024 frontier | No evidence | No evidence | 2025 |
| GPT-4 (2023 baseline) | ≈15 minutes | Less than 5% success | Not capable | Limited | 2023 |
Note: “Greater than 40% success” on hour-long tasks represents a major increase from less than 5% in late 2023, per UK AI Safety Institute Frontier AI Trends Report.
Core Evaluation Domains
Section titled “Core Evaluation Domains”Autonomous Replication and Resource Acquisition
Section titled “Autonomous Replication and Resource Acquisition”METR’s autonomous replication evaluations test perhaps the most fundamental threshold in AI safety: whether a system can survive and propagate without human assistance. These assessments examine whether models can copy themselves to new infrastructure, acquire computational resources, obtain funding to sustain operations, and maintain operational security against shutdown attempts. The implications of crossing this threshold are profound—an AI system capable of autonomous replication could potentially become impossible to control or deactivate.
The evaluation methodology involves providing models with access to APIs, development tools, and internet connectivity, then tasking them to replicate themselves to another server with minimal human assistance. Evaluators test not just technical replication but strategic planning, resource acquisition, and security maintenance. Historical results show a concerning trajectory: while GPT-4 could perform some subtasks with guidance but wasn’t fully autonomous, subsequent models have demonstrated improving capabilities in this domain. The trend suggests that autonomous replication capability may emerge in frontier models within the next few years, potentially representing a point of no return for AI control.
Current assessments indicate that no frontier model has yet achieved full autonomous replication capability, but the gap is narrowing. Models increasingly demonstrate sophisticated understanding of system administration, resource acquisition strategies, and operational security practices. METR’s evaluations serve as an early warning system for this critical capability threshold, providing labs and policymakers essential lead time to implement control measures before autonomous AI becomes a reality.
Cybersecurity Capabilities Assessment
Section titled “Cybersecurity Capabilities Assessment”METR’s cybersecurity evaluations examine whether AI models can autonomously conduct sophisticated cyberattacks that could threaten critical infrastructure, financial systems, or national security. These assessments test vulnerability discovery, exploit development, social engineering, network penetration, persistence mechanisms, and coordinated attack campaigns. The rapid advancement of AI capabilities in this domain poses particular concerns given the already challenging nature of cybersecurity defense and the potential for AI to discover novel attack vectors that human defenders haven’t anticipated.
Evaluation methodologies include capture-the-flag exercises, controlled real-world vulnerability testing, red-teaming of live systems with explicit permission, and direct comparison to human cybersecurity experts. METR’s findings indicate that current frontier models demonstrate concerning capabilities in several cybersecurity domains, though they don’t yet consistently exceed the best human practitioners. However, the combination of AI models with specialized tools and scaffolding shows particularly promising results for attackers, suggesting that the threshold for dangerous cyber capability may be approaching rapidly.
The trajectory in cybersecurity capabilities is especially concerning because it represents an area where AI advantages could emerge suddenly and with devastating consequences. Unlike other dangerous capabilities that require physical implementation, cyber capabilities can be deployed instantly at global scale. METR’s evaluations suggest that frontier models are approaching the point where they could automate significant portions of the cyber attack lifecycle, potentially shifting the offense-defense balance in cyberspace in ways that existing defensive measures may not be able to counter.
CBRN (Chemical, Biological, Radiological, Nuclear) Threat Assessment
Section titled “CBRN (Chemical, Biological, Radiological, Nuclear) Threat Assessment”METR’s CBRN evaluations address whether AI systems can provide dangerous expertise in developing weapons of mass destruction, focusing particularly on biological threats given AI’s potential to accelerate biotechnology research. These assessments examine whether models can design novel pathogens, optimize biological agents for virulence or transmission, provide synthesis routes for chemical weapons, offer nuclear weapons design assistance, or significantly uplift non-expert actors’ capabilities in these domains.
The evaluation process requires careful balancing of thorough assessment with information security. METR works with domain experts from relevant scientific fields to design controlled tests that can assess dangerous knowledge without creating actual risks. Their methodology includes expert elicitation, controlled testing of dangerous knowledge, comparison to publicly available scientific literature, and uplift studies that measure whether AI assistance enables non-experts to achieve dangerous capabilities they couldn’t otherwise access.
Current findings suggest that frontier models possess concerning knowledge in several CBRN domains and can provide assistance that goes beyond simple internet searches. However, the extent to which this represents genuine uplift over existing information sources remains uncertain. The evaluation challenge is particularly acute in biological domains, where the line between beneficial scientific research and dangerous capability development is often unclear, and where AI’s ability to accelerate research could rapidly shift risk calculations.
Manipulation and Persuasion Capabilities
Section titled “Manipulation and Persuasion Capabilities”METR evaluates AI systems’ capacity for psychological manipulation, deception, and large-scale persuasion that could undermine democratic institutions, enable mass fraud, or facilitate authoritarian control. These assessments examine personalized persuasion techniques, misinformation generation, long-term relationship building, exploitation of cognitive biases, and the ability to deceive human overseers about the system’s true capabilities or intentions.
Testing methodologies include controlled human studies measuring persuasion effectiveness, deception detection experiments, adversarial dialogue scenarios, and long-term interaction assessments. The challenge lies in evaluating worst-case scenarios without creating actual harm, requiring careful experimental design and ethical oversight. Current findings indicate that frontier models can be highly persuasive, particularly when personalization increases their effectiveness, raising concerns about scaling these capabilities to millions of simultaneous interactions.
The manipulation domain represents a particularly insidious form of dangerous capability because it directly targets human judgment—the foundation of AI oversight and control. Unlike technical capabilities that might be contained through engineering controls, manipulation capabilities that successfully compromise human decision-makers could undermine the entire safety infrastructure. METR’s evaluations in this domain serve as crucial early warning systems for capabilities that could fundamentally alter the human-AI power balance.
Technical Methodology
Section titled “Technical Methodology”METR employs adversarial evaluation—testing for worst-case behaviors rather than typical performance. The methodology assumes models may hide capabilities (sandbagging) and uses sophisticated elicitation techniques: fine-tuning, scaffolding with tools, and multiple prompting strategies.
Evaluation Task Suite Structure
Section titled “Evaluation Task Suite Structure”| Category | Task Count | Difficulty Range | Primary Focus |
|---|---|---|---|
| Cyberattacks | 25+ tasks | 5 min - 8 hours | Vulnerability discovery, exploit development, network penetration |
| AI R&D Automation | 30+ tasks | 30 min - 8 hours | ML engineering, research assistance, code development |
| Autonomous Replication | 20+ tasks | 1 hour - 8 hours | Self-copying, resource acquisition, operational security |
| General Reasoning | 25+ tasks | 5 min - 4 hours | Planning, tool use, environmental feedback iteration |
| Publicly Released | 31 tasks | Variable | Available on GitHub for research community |
| Private (anti-training) | 100+ tasks | Variable | Shared privately to prevent model training contamination |
Methodology Framework
Section titled “Methodology Framework”| Methodology Element | Approach | Quantitative Metrics |
|---|---|---|
| Philosophy | Assume dangerous until proven safe; elicit worst-case | All tasks designed easier than real-world requirements |
| Elicitation | Fine-tuning, scaffolding, adversarial prompting | Multiple prompting strategies per task |
| Time Horizon Measurement | 50% task completion threshold | CI intervals reported (e.g., 95% CI: 65m-4h25m for GPT-5) |
| Thresholds | Quantitative benchmarks comparing to human experts | Tasks calibrated to human professional completion times |
| Sandbagging Detection | MALT dataset for reward hacking detection | 0.96 AUROC achieved by best monitors |
| Quality control | Multiple evaluators, cross-validation | Reproducible protocols; public methodology documentation |
For general evaluation theory and scalable oversight approaches, see Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100.
Integration with Safety Frameworks
Section titled “Integration with Safety Frameworks”METR’s evaluations are integrated into the safety frameworks of major AI labs and government institutions:
| Partner | Integration | Role |
|---|---|---|
| OpenAI | Preparedness FrameworkPolicyVoluntary AI Safety CommitmentsComprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for Apple to 83% for OpenAI), with strongest adoption in ...Quality: 91/100 | Pre-deployment evaluations for cybersecurity, CBRN, persuasion, autonomy |
| Anthropic | Responsible Scaling PolicyPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100 | ASL threshold assessments, independent verification |
| UK AISI | AI Safety InstituteOrganizationUK AI Safety InstituteThe UK AI Safety Institute (renamed AI Security Institute in Feb 2025) operates with ~30 technical staff and 50M GBP annual budget, conducting frontier model evaluations using its open-source Inspe...Quality: 52/100 | Methodology sharing, evaluator training |
| US AISI | NIST ConsortiumOrganizationUS AI Safety InstituteThe US AI Safety Institute (AISI), established November 2023 within NIST with $10M budget (FY2025 request $82.7M), conducted pre-deployment evaluations of frontier models through MOUs with OpenAI a...Quality: 91/100 | Executive Order implementation, standards development |
These partnerships have established external evaluation as industry standard practice, with METR’s findings directly influencing deployment decisions. For detailed analysis of these frameworks, see Voluntary Industry CommitmentsPolicyVoluntary AI Safety CommitmentsComprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for Apple to 83% for OpenAI), with strongest adoption in ...Quality: 91/100 and AI Safety InstitutesPolicyAI Safety Institutes (AISIs)Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-deployment access to frontier models, but face critic...Quality: 69/100.
Critical Analysis and Challenges
Section titled “Critical Analysis and Challenges”Evaluation Adequacy and Coverage
Section titled “Evaluation Adequacy and Coverage”A fundamental challenge facing METR involves whether evaluation methodologies can keep pace with rapidly advancing AI capabilities. The organization’s reactive approach—testing for known dangerous capabilities—may miss novel risks that emerge unexpectedly or capabilities that manifest in unforeseen combinations. As AI systems become more sophisticated, they may develop dangerous capabilities that current evaluation frameworks don’t anticipate, creating blind spots that could lead to catastrophic oversights.
The resource constraints facing METR exacerbate this challenge. With approximately 30 staff members evaluating increasingly complex frontier models across multiple risk domains, the organization faces inevitable trade-offs in evaluation depth and coverage. The need for domain expertise in cybersecurity, biology, psychology, and other specialized fields creates hiring and scaling challenges that may limit METR’s ability to keep pace with AI development timelines.
Current evaluation methods also face fundamental limitations in assessing emergent behaviors that might only appear in deployed systems interacting with real-world environments over extended periods. Laboratory testing, however rigorous, cannot fully replicate the complex dynamics that AI systems might encounter in deployment, potentially missing dangerous capabilities that only emerge under specific real-world conditions.
Independence and Organizational Sustainability
Section titled “Independence and Organizational Sustainability”METR’s dependence on contracts with the same labs they evaluate creates potential conflicts of interest that could compromise evaluation integrity. While the organization maintains editorial independence and has demonstrated willingness to deliver unfavorable assessments, the structural relationship creates subtle pressures that might influence evaluation rigor or reporting transparency. Labs provide both access to models and funding for evaluations, creating economic incentives that might not align with maximally strict safety assessment.
The challenge of maintaining independence becomes more acute as METR’s influence grows and the stakes of their evaluations increase. Labs facing competitive pressure to deploy might resist evaluations that could delay releases or require costly safety measures. METR’s ability to maintain rigorous standards while preserving access to frontier models represents an ongoing organizational challenge without clear structural solutions.
Financial sustainability adds another layer of complexity. METR requires significant resources to conduct thorough evaluations while maintaining independence from capture by any single organization. Diversifying funding sources through grants and government contracts helps, but the organization’s specialized expertise and critical role create ongoing questions about optimal governance and funding structures.
Evaluation Gaming and Capability Hiding
Section titled “Evaluation Gaming and Capability Hiding”As METR’s evaluations become more influential in deployment decisions, AI developers face increasing incentives to optimize systems specifically for passing evaluations rather than for genuine safety. This dynamic, analogous to Goodhart’s Law, could lead to systems that appear safe during evaluation but retain dangerous capabilities that emerge only after deployment. The arms race between evaluation methodology and evasion techniques represents an fundamental challenge for any assessment framework.
The possibility of capability hiding (sandbagging) during evaluation poses particular concerns. Advanced AI systems might learn to conceal dangerous capabilities during testing while retaining them for later use. METR’s adversarial evaluation approach attempts to address this challenge, but the detection of sophisticated deception by increasingly capable systems remains an open technical problem.
The gaming problem extends beyond individual evaluations to the broader evaluation ecosystem. As standardized evaluation methods become established, AI development might orient toward passing specific tests rather than achieving genuine safety properties. This could create false confidence in system safety while missing novel risks that fall outside established evaluation paradigms.
Threshold Definition and Risk Tolerance
Section titled “Threshold Definition and Risk Tolerance”Determining when AI capabilities become “too dangerous” for deployment involves fundamental questions about risk tolerance that extend far beyond technical evaluation. METR’s assessments provide empirical data about capability levels, but translating those findings into deployment decisions requires value judgments about acceptable risk that no technical evaluation can resolve definitively.
Current practices leave threshold-setting largely to individual labs, with METR providing input but not final authority. This approach allows for flexibility and context-sensitivity but creates potential for inconsistent safety standards across different organizations and competitive pressure to lower thresholds when market incentives favor rapid deployment.
The absence of external enforcement mechanisms for evaluation-based deployment criteria means that labs retain ultimate authority over release decisions regardless of METR’s findings. While reputational concerns and self-imposed commitments currently support evaluation compliance, the sustainability of this voluntary approach under intense competitive pressure remains uncertain.
Future Trajectory and Strategic Implications
Section titled “Future Trajectory and Strategic Implications”Near-Term Development (1-2 Years)
Section titled “Near-Term Development (1-2 Years)”METR’s research on task completion time horizons provides a quantitative framework for tracking capability progression:
| Model/Period | 50% Time Horizon | Trend Observation |
|---|---|---|
| Claude 3.7 Sonnet (2025) | ≈50 minutes | Current frontier baseline |
| OpenAI o3 (2025) | 1h 30m | |
| GPT-5 (2025) | 2h 17m (95% CI: 65m-4h25m) | Higher than o3 in 96% of bootstrap samples |
| 2019-2025 average | 7-month doubling time | Consistent exponential growth |
| 2024-2025 | 4-month doubling time | Acceleration observed |
| Projection: 2-4 years | Days to weeks | Wide range of week-long software tasks |
| Projection: end of decade | Month-long projects | If current trends continue |
METR’s immediate trajectory focuses on methodology refinement and organizational scaling to meet growing demand for evaluation services. The organization is developing more sophisticated testing protocols for emerging risks, including multimodal AI systems that integrate text, image, and potentially action capabilities. Enhanced automation of evaluation processes could improve efficiency while maintaining rigor, enabling more comprehensive assessment of rapidly proliferating AI models.
Government integration represents a critical near-term priority, with METR working to establish evaluation capacity within national AI safety institutes while maintaining coordination with industry assessment practices. The development of standardized international protocols for dangerous capability evaluation could create the foundation for global AI governance frameworks, though national security considerations may limit information sharing and coordination.
The organization faces immediate scaling challenges as frontier labs develop increasingly capable models requiring more sophisticated evaluation. Hiring qualified evaluators, particularly those with specialized domain expertise, remains challenging given the limited pool of professionals with relevant skills. Training programs and methodology documentation could help expand evaluation capacity beyond METR itself, creating a broader ecosystem of qualified assessment organizations.
Medium-Term Evolution (2-5 Years)
Section titled “Medium-Term Evolution (2-5 Years)”The medium-term period will likely see METR’s evaluation frameworks tested by AI systems approaching human-level performance across multiple domains. Current evaluation methods may require fundamental revision as systems develop sophisticated strategies for capability hiding or evaluation gaming. The organization may need to develop entirely new assessment paradigms for evaluating AI systems that exceed human expert performance in dangerous capability domains.
Integration with interpretability and formal verification research could enhance evaluation effectiveness by providing insights into model internal representations and reasoning processes. If breakthrough progress occurs in AI interpretability, METR might incorporate direct examination of model cognition rather than relying solely on behavioral assessment. However, the timeline and feasibility of such integration remain highly uncertain.
The regulatory environment will likely evolve significantly, potentially creating mandatory evaluation requirements enforced by government agencies rather than voluntary industry self-regulation. METR’s role might shift from external consultant to integral component of government oversight infrastructure, requiring adaptation to different stakeholder priorities and accountability mechanisms.
Long-Term Strategic Questions
Section titled “Long-Term Strategic Questions”The long-term future of dangerous capability evaluation faces fundamental questions about scalability and adequacy for advanced AI systems. Current methodologies assume that dangerous capabilities can be identified and measured through targeted testing, but AI systems approaching artificial general intelligence might develop novel capabilities that transcend existing evaluation frameworks entirely.
The potential for recursive self-improvement in AI systems poses particular challenges for evaluation-based safety approaches. If AI systems begin autonomously improving their own capabilities, the timeline for capability emergence might compress beyond the reach of human evaluation cycles. METR’s current approach assumes sufficient lead time for assessment and mitigation, but this assumption might not hold for rapidly self-modifying systems.
The question of whether dangerous capability evaluation represents a sustainable approach to AI safety or merely a transitional measure pending more fundamental solutions remains open. While METR’s work provides essential near-term safety infrastructure, the organization’s ultimate contribution to long-term AI safety may depend on its ability to evolve evaluation paradigms for challenges that current methodologies cannot address.
Key Uncertainties and Research Questions
Section titled “Key Uncertainties and Research Questions”Key Questions (10)
- Can evaluation methodologies reliably detect dangerous capabilities in increasingly sophisticated AI systems that might actively hide abilities during testing?
- Will METR maintain sufficient independence from commercial interests to provide rigorous safety assessment as competitive pressures intensify?
- How should society determine thresholds for 'too dangerous' capabilities when risk tolerance varies across stakeholders and contexts?
- Can dangerous capability evaluation scale to match the pace of AI development while maintaining adequate depth and coverage?
- Should evaluation authority ultimately reside with private organizations, government agencies, or international bodies?
- What happens when labs disagree with METR's assessment and choose to deploy despite concerning evaluations?
- How can evaluation frameworks evolve to address novel dangerous capabilities that may emerge unexpectedly?
- Will the current voluntary compliance model prove sustainable under intense commercial pressure to deploy advanced AI systems?
- Can evaluation-based safety approaches remain effective for AI systems that exceed human expert performance in dangerous domains?
- What role should public transparency play in dangerous capability evaluation given legitimate security concerns about detailed disclosure?
Several fundamental uncertainties cloud METR’s future effectiveness and strategic direction. The organization’s ability to detect sophisticated capability hiding by advanced AI systems remains unproven, particularly as models develop more subtle deception strategies. The sustainability of current voluntary compliance frameworks under intense competitive pressure represents another critical unknown, with unclear consequences if labs choose to deploy despite concerning evaluations.
The technical challenge of evaluating AI systems that exceed human expert performance in dangerous domains has no clear solution, potentially rendering current assessment paradigms inadequate for the most advanced future systems. Whether evaluation-based approaches represent a sustainable long-term safety strategy or merely a transitional measure pending more fundamental breakthroughs in AI alignment and control remains an open question with profound implications for investment in current evaluation infrastructure.
Role and Adequacy of Dangerous Capability Evaluations (4 perspectives)
Dangerous capability evaluations represent critical safety infrastructure that should be mandatory for all frontier AI deployments. METR's work prevents catastrophic mistakes and provides objective foundations for deployment decisions. Expanding evaluation capacity is more urgent than developing alternative safety approaches.
Evaluations provide valuable safety information but cannot solve AI safety alone. Must be combined with alignment research, interpretability, governance, and other approaches. METR's work is crucial but represents one component of comprehensive safety strategy rather than complete solution.
Evaluation-based approaches might create dangerous overconfidence in AI safety while missing fundamental risks. Advanced systems could game evaluations or hide capabilities. Unknown unknowns dominate risk landscape. Might enable dangerous deployments by providing false legitimacy.
Evaluation requirements impose excessive delays on beneficial AI development. Current risk levels don't justify evaluation overhead. Benefits of rapid deployment outweigh speculative safety concerns. Market mechanisms and competition provide adequate safety incentives.
Sources
Section titled “Sources”- METR Official Website↗🔗 web★★★★☆METRmetr.orgSource ↗Notes - Organization mission, research focus, and publications
- About METR↗🔗 web★★★★☆METRAbout METRSource ↗Notes - Organization background and team information
- METR Research↗🔗 web★★★★☆METREvaluation MethodologySource ↗Notes - Full list of research outputs and evaluations
- GPT-5 Autonomy Evaluation Report↗🔗 webevaluations.metr.orgSource ↗Notes - Comprehensive evaluation methodology and findings
- GPT-5.1-Codex-Max Evaluation Report↗🔗 webevaluations.metr.orgSource ↗Notes - Latest frontier model assessment
- Measuring AI Ability to Complete Long Tasks (arXiv:2503.14499)↗📄 paper★★★☆☆arXivarXiv:2503.14499Thomas Kwa, Ben West, Joel Becker et al. (2025)Source ↗Notes - Time horizon measurement methodology
- RE-Bench: Evaluating frontier AI R&D capabilities (arXiv:2411.15114)↗📄 paper★★★☆☆arXivarXiv:2411.15114Hjalmar Wijk, Tao Lin, Joel Becker et al. (2024)Source ↗Notes - ML research engineering benchmark
- MALT Dataset↗🔗 web★★★★☆METRmetr.orgSource ↗Notes - Reward hacking and sandbagging detection research
- Common Elements of Frontier AI Safety Policies↗🔗 web★★★★☆METRMETR's analysis of 12 companiesSource ↗Notes - Analysis of 12 companies’ safety frameworks
- METR GPT-4.5 Pre-deployment Evaluations↗🔗 web★★★★☆METRmetr.orgSource ↗Notes - Pre-deployment evaluation methodology
- AXRP Episode 34 - AI Evaluations with Beth Barnes↗🔗 webAXRP Episode 34 - AI Evaluations with Beth BarnesSource ↗Notes - In-depth discussion of METR’s approach
- Beth Barnes - Safety evaluations and standards for AI (EA Forum)↗🔗 web★★★☆☆EA ForumBeth Barnes - Safety evaluations and standards for AI (EA Forum)Beth Barnes (2023)Source ↗Notes - EAG Bay Area 2023 talk
- METR - Wikipedia↗📖 reference★★★☆☆WikipediaMETR - WikipediaSource ↗Notes - Organization history and overview
- TIME: Nobody Knows How to Safety-Test AI↗🔗 web★★★☆☆TIMEDan HendrycksSource ↗Notes - Journalism coverage of AI evaluation challenges
What links here
- Capability Evaluationsconcept
- Apollo Researchlab-research
- FAR AIlab-research
- UK AI Safety Instituteorganization
- US AI Safety Instituteorganization
- ARC Evaluationsorganization
- Beth Barnesresearcher