Measuring AI Ability to Complete Long Tasks - METR

web

METR·metr.org/blog/2025-03-19-measuring-ai-ability-to-complete...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Data Status

Full text fetchedFetched Dec 28, 2025

Summary

Research by METR demonstrates that AI models' ability to complete tasks is exponentially increasing, with task completion time doubling approximately every 7 months. This metric provides insights into AI's real-world capability progression.

Key Points

•AI task-completion length doubles approximately every 7 months
•Current models reliably complete tasks under 4 minutes, struggling with longer tasks
•Exponential trend suggests AI could autonomously handle week-long tasks in near future
•Novel methodology links benchmark performance to real-world task completion

Review

METR's research introduces an innovative approach to measuring AI capabilities by tracking the length of tasks generalist models can complete autonomously. By recording the time human experts take to complete various software and reasoning tasks, they developed a method to characterize AI models' performance across different task durations. Their key finding is a remarkably consistent exponential trend in AI task completion abilities, with a doubling time of around 7 months over the past six years. The study's significance lies in bridging the gap between benchmark performance and real-world utility, highlighting that current AI models excel at short tasks but struggle with complex, extended projects. By extrapolating their trend, the researchers predict that within a decade, AI agents might independently complete substantial software tasks currently requiring days or weeks of human effort. While acknowledging methodological limitations and potential measurement errors, their sensitivity analyses suggest the trend remains robust, with implications for AI development, forecasting, and risk management.

Cited by 10 pages

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0
Self-Improvement and Recursive Enhancement	Capability	69.0
Epoch AI	Organization	51.0
METR	Organization	66.0
Capability Elicitation	Approach	91.0
Dangerous Capability Evaluations	Approach	64.0
Responsible Scaling Policies	Policy	62.0
Scalable Eval Approaches	Approach	65.0
Tool-Use Restrictions	Approach	91.0
Emergent Capabilities	Risk	61.0

Resource ID: 271fc5f73a8304b2 | Stable ID: NzE4Y2Q4ZT