Skip to content
Longterm Wiki
Back

METR: Model Evaluation and Threat Research

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

METR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.

Metadata

Importance: 82/100homepage

Summary

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

Key Points

  • Developed the 'Task-Completion Time Horizon' metric showing exponential increase in AI autonomous task completion capability over 6 years.
  • Conducts third-party evaluations for frontier AI labs assessing risks from AI self-improvement, rogue replication, and sabotage of oversight.
  • Empirical finding that AI tools can make experienced open-source developers 19% slower, challenging assumptions about productivity gains.
  • Focuses on evaluating broad autonomous capabilities and AI's ability to accelerate AI R&D—key precursors to loss-of-control scenarios.
  • Studies potential AI behaviors that threaten evaluation integrity and develops mitigations for such behaviors.

Cited by 34 pages

Cached Content Preview

HTTP 200Fetched Apr 12, 20268 KB
METR 

 

 

 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 Research 
 

 
 
 Notes 
 

 
 
 Updates 
 

 
 
 About 
 

 
 
 Donate 
 

 
 
 Careers 
 

 

 
 
 
 Search
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 Careers 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 Menu 
 
 
 

 
 
 

 
 
 

 

 

 
 
 Model Evaluation & Threat Research 
 METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems. 
 
 Our research 
 Careers 
 
 
 
 
 
 We’ve worked with 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) TH 1.1 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) 
 Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement. 
 
 
 Time Horizon 1.0 (Mar 2025) 
 Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper. 
 
 
 
 
 Log Scale 
 Linear Scale 
 
 
 50% Success 
 80% Success 
 
 
 
 
 
 
 
 
 
 

 

 

 
 
 Task-Completion Time Horizons of Frontier AI Models 

 We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.

 
 
 Read paper 
 View repo 
 
 
 
 

 
 
 Featured research

 

 Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.

 View all research 

 
 
 
 
 General 
 

 
 Technical 
 

 
 Policy 
 

 
 View all research 
 

 
 
 
 
 
 
 
 GPT-5.1 Evaluation Results

 We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

 
 Read more 
 
 
 
 
 
 
 
 Measuring AI Ability to Complete Long Tasks

 We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

 
 Read more 
 
 
 
 
 
 
 
 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

 We found that when developers used AI tools in early 2025, they took 19% longer than without—AI made them slower.

 
 Read more 
 
 
 
 
 
 
 
 
 
 MALT

 A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging

 
 Read more 
 
 
 
 
 
 
 
 
 
 Measuring autonomous AI capabilities — resource collection

 An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks

 
 Read more 
 
 
 


... (truncated, 8 KB total)
Resource ID: 45370a5153534152 | Stable ID: sid_uWA7hP6A14