Measuring AI Ability to Complete Long Tasks - METR

web

METR·metr.org/blog/2025-03-19-measuring-ai-ability-to-complete...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Published by METR (Model Evaluation and Threat Research) in March 2025, this research is directly relevant to AI safety evaluations and informing thresholds for capability-based deployment decisions and governance frameworks.

Metadata

Importance: 78/100blog postprimary source

Summary

METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.

Key Points

•AI task completion horizon (the longest tasks models can reliably complete) has been doubling approximately every 7 months across recent frontier models.
•The metric focuses on autonomous, multi-step task completion rather than narrow benchmarks, better reflecting real-world agentic capability.
•Exponential growth in task length has significant implications for estimating when AI could perform complex, extended work autonomously including dangerous tasks.
•This trajectory suggests AI agents capable of weeks-long autonomous work may arrive sooner than expected, raising urgent safety and governance concerns.
•METR's approach provides a more practically meaningful capability metric than traditional benchmarks for tracking progress toward transformative AI.

Review

METR's research introduces an innovative approach to measuring AI capabilities by tracking the length of tasks generalist models can complete autonomously. By recording the time human experts take to complete various software and reasoning tasks, they developed a method to characterize AI models' performance across different task durations. Their key finding is a remarkably consistent exponential trend in AI task completion abilities, with a doubling time of around 7 months over the past six years. The study's significance lies in bridging the gap between benchmark performance and real-world utility, highlighting that current AI models excel at short tasks but struggle with complex, extended projects. By extrapolating their trend, the researchers predict that within a decade, AI agents might independently complete substantial software tasks currently requiring days or weeks of human effort. While acknowledging methodological limitations and potential measurement errors, their sensitivity analyses suggest the trend remains robust, with implications for AI development, forecasting, and risk management.

Cited by 8 pages

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0
Epoch AI	Organization	51.0
METR	Organization	66.0
Capability Elicitation	Approach	91.0
Dangerous Capability Evaluations	Approach	64.0
Scalable Eval Approaches	Approach	65.0
Tool-Use Restrictions	Approach	91.0
Emergent Capabilities	Risk	61.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202614 KB

Measuring AI Ability to Complete Long Tasks - METR 

 
 
 
 
 

 

 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 Research 
 

 
 
 Notes 
 

 
 
 Updates 
 

 
 
 About 
 

 
 
 Donate 
 

 
 
 Careers 
 

 

 
 
 
 Search
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 Careers 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 Menu 
 
 
 

 
 Measuring AI Ability to Complete Long Tasks 
 We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. 
 
 
 
 
 
 
 CONTRIBUTORS

 
 
 
 
 
 
 
 Thomas Kwa , 
 
 
 
 
 
 
 Ben West , 
 
 
 
 
 
 
 Joel Becker , 
 
 
 
 and
 
 
 21 others 
 
 
 
 
 
 
 
 DATE

 March 19, 2025 
 
 
 
 
 SHARE

 
 
 Copy Link
 
 
 
 
 arXiv
 
 
 
 Citation
 
 
 
 
 BibTeX Citation 
 &times; 
 
 
 @misc { measuring-ai-ability-to-complete-long-tasks , 
 title = {Measuring AI Ability to Complete Long Tasks} , 
 author = {Thomas Kwa, Ben West, Joel Becker, 21 others} , 
 howpublished = {\url{https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/}} , 
 year = {2025} , 
 month = {03} , 
 } 
 
 Copy 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 

 
 
 

 

 
 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) TH 1.1 
 
 
 
 
 
 
 Time Horizon 1.1 (Current) 
 Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement. 
 
 
 Time Horizon 1.0 (Mar 2025) 
 Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper. 
 
 
 
 
 Log Scale 
 Linear Scale 
 
 
 50% Success 
 80% Success 
 
 
 
 
 
 
 
 
 
 

 

 

 
 Analysis code is available on GitHub . Raw data available here 

 
 
 
 

 This is our most up-to-date measurement of the task-completion time horizons for public language models. For methodology details and FAQs, see our dedicated time horizons page .

 Summary

 We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

 
 
 
 
 The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can 

... (truncated, 14 KB total)

Resource ID: 271fc5f73a8304b2 | Stable ID: sid_WJiOxw2Dcv