Back
Measuring AI Ability to Complete Long Tasks - METR
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
Published by METR (Model Evaluation and Threat Research) in March 2025, this research is directly relevant to AI safety evaluations and informing thresholds for capability-based deployment decisions and governance frameworks.
Metadata
Importance: 78/100blog postprimary source
Summary
METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.
Key Points
- •AI task completion horizon (the longest tasks models can reliably complete) has been doubling approximately every 7 months across recent frontier models.
- •The metric focuses on autonomous, multi-step task completion rather than narrow benchmarks, better reflecting real-world agentic capability.
- •Exponential growth in task length has significant implications for estimating when AI could perform complex, extended work autonomously including dangerous tasks.
- •This trajectory suggests AI agents capable of weeks-long autonomous work may arrive sooner than expected, raising urgent safety and governance concerns.
- •METR's approach provides a more practically meaningful capability metric than traditional benchmarks for tracking progress toward transformative AI.
Review
METR's research introduces an innovative approach to measuring AI capabilities by tracking the length of tasks generalist models can complete autonomously. By recording the time human experts take to complete various software and reasoning tasks, they developed a method to characterize AI models' performance across different task durations. Their key finding is a remarkably consistent exponential trend in AI task completion abilities, with a doubling time of around 7 months over the past six years.
The study's significance lies in bridging the gap between benchmark performance and real-world utility, highlighting that current AI models excel at short tasks but struggle with complex, extended projects. By extrapolating their trend, the researchers predict that within a decade, AI agents might independently complete substantial software tasks currently requiring days or weeks of human effort. While acknowledging methodological limitations and potential measurement errors, their sensitivity analyses suggest the trend remains robust, with implications for AI development, forecasting, and risk management.
Cited by 8 pages
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Epoch AI | Organization | 51.0 |
| METR | Organization | 66.0 |
| Capability Elicitation | Approach | 91.0 |
| Dangerous Capability Evaluations | Approach | 64.0 |
| Scalable Eval Approaches | Approach | 65.0 |
| Tool-Use Restrictions | Approach | 91.0 |
| Emergent Capabilities | Risk | 61.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 202614 KB
Measuring AI Ability to Complete Long Tasks - METR
Research
Notes
Updates
About
Donate
Careers
Search
-->
Research
Notes
Updates
About
Donate
Careers
Menu
Measuring AI Ability to Complete Long Tasks
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
CONTRIBUTORS
Thomas Kwa ,
Ben West ,
Joel Becker ,
and
21 others
DATE
March 19, 2025
SHARE
Copy Link
arXiv
Citation
BibTeX Citation
×
@misc { measuring-ai-ability-to-complete-long-tasks ,
title = {Measuring AI Ability to Complete Long Tasks} ,
author = {Thomas Kwa, Ben West, Joel Becker, 21 others} ,
howpublished = {\url{https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/}} ,
year = {2025} ,
month = {03} ,
}
Copy
Time Horizon 1.1 (Current) TH 1.1
Time Horizon 1.1 (Current)
Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement.
Time Horizon 1.0 (Mar 2025)
Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper.
Log Scale
Linear Scale
50% Success
80% Success
Analysis code is available on GitHub . Raw data available here
This is our most up-to-date measurement of the task-completion time horizons for public language models. For methodology details and FAQs, see our dedicated time horizons page .
Summary
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can
... (truncated, 14 KB total)Resource ID:
271fc5f73a8304b2 | Stable ID: sid_WJiOxw2Dcv