Credibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
METR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.
Metadata
Summary
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
Key Points
- •Developed the 'Task-Completion Time Horizon' metric showing exponential increase in AI autonomous task completion capability over 6 years.
- •Conducts third-party evaluations for frontier AI labs assessing risks from AI self-improvement, rogue replication, and sabotage of oversight.
- •Empirical finding that AI tools can make experienced open-source developers 19% slower, challenging assumptions about productivity gains.
- •Focuses on evaluating broad autonomous capabilities and AI's ability to accelerate AI R&D—key precursors to loss-of-control scenarios.
- •Studies potential AI behaviors that threaten evaluation integrity and develops mitigations for such behaviors.
Cited by 34 pages
Cached Content Preview
METR
Research
Notes
Updates
About
Donate
Careers
Search
-->
Research
Notes
Updates
About
Donate
Careers
Menu
Model Evaluation & Threat Research
METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems.
Our research
Careers
We’ve worked with
Time Horizon 1.1 (Current) TH 1.1
Time Horizon 1.1 (Current)
Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement.
Time Horizon 1.0 (Mar 2025)
Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper.
Log Scale
Linear Scale
50% Success
80% Success
Task-Completion Time Horizons of Frontier AI Models
We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.
Read paper
View repo
Featured research
Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.
View all research
General
Technical
Policy
View all research
GPT-5.1 Evaluation Results
We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.
Read more
Measuring AI Ability to Complete Long Tasks
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.
Read more
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
We found that when developers used AI tools in early 2025, they took 19% longer than without—AI made them slower.
Read more
MALT
A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging
Read more
Measuring autonomous AI capabilities — resource collection
An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks
Read more
... (truncated, 8 KB total)45370a5153534152 | Stable ID: sid_uWA7hP6A14