Skip to content
Longterm Wiki

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Metadata

Cited by 2 pages

PageTypeQuality
Eval Saturation & The Evals GapApproach65.0
Scalable Eval ApproachesApproach65.0

Cached Content Preview

HTTP 200Fetched May 31, 202615 KB
Time Horizon 1.1 - METR 

 
 
 
 
 

 

 
 

 

 

 
 

 

 
 

 

 
 
 
 
 
 
 

 

 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Our Work 
 
 
 
 
 
 
 
 
 
 
 Research 
 

 
 
 
 
 
 
 
 Notes 
 

 
 
 
 
 
 
 
 Updates 
 

 
 
 
 
 
 
 
 Risk Assessment 
 

 
 
 
 

 
 
 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 
 
 Careers 
 

 
 
 

 
 
 Search 
 

 
 
 

 
 
 

 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -->
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 Our Work 
 

 
 
 
 
 
 
 
 
 
 

 
 Research 
 
 

 

 
 
 
 
 
 
 
 
 

 
 Notes 
 
 

 

 
 
 
 
 
 
 
 
 

 
 Updates 
 
 

 

 
 
 
 
 
 
 
 
 

 
 Risk Assessment 
 
 

 

 
 

 

 
 
 
 
 
 
 
 
 
 
 About 
 

 
 
 
 
 
 
 
 
 
 
 Donate 
 

 
 
 
 
 
 
 
 
 
 
 Careers 
 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 

 
 

 
 Menu 
 

 
 

 

 
 
 
 
 
 × 
 
 

 
 Time Horizon 1.1 
 
 
 
 
 
 
 
 
 
 DATE

 January 29, 2026 
 
 
 
 

 
 
 SHARE

 
 
 Copy Link
 
 
 
 Citation
 
 
 
 
 BibTeX Citation 
 × 
 
 
 

 @misc { metr-2026-time-horizon-1-1 , 
 title = {Time Horizon 1.1} , 
 author = {METR} , 
 howpublished = {\url{https://metr.org/blog/2026-1-29-time-horizon-1-1/}} , 
 year = {2026} , 
 month = {01} , 
 } 
 
 Copy 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 

 We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure. 

 Our estimates of time horizons for many models have been updated. The new estimates generally fall within our existing confidence intervals, though the trend in time horizon growth looks a little different, discussed below. We expect to make more changes to our evaluation protocols so that we can capture the continued rapid growth in capabilities.

 

 

 Early in 2025 we published our time-horizon methodology for measuring the autonomous capabilities of AI models. 

 We found a steady exponential increase in models’ human-equivalent “time horizon.” Over the course of 2025 we applied this methodology to newer models and measured a rate of increase consistent with historical trends.

 We are rolling out two significant changes to our time-horizon evaluation setup: 

 
 
 Improvements to our task suite. We increased our suite from 170 to 228 tasks. We added 73 tasks (all are from HCAST, described in Rein (2025) ), removed 15 tasks, and updated 53 tasks (27 tasks had an updated definition, 13 tasks had an updated human time estimate, and 13 had both). We increased the number of long tasks (estimated to take humans 8 or more hours) from 14 to 31. The additions represent HCAST tasks that were not included in the original time horizon paper, but subsequently passed our quality check processes. The modifications and removals generally represent cases where a task description was confusing, or easy to reward-hack, or the scoring function had errors.

 

 
 A move of our evaluation infrastructure from Vivaria to Inspect. We developed Vivaria in-house in 2023. Inspe

... (truncated, 15 KB total)
Resource ID: 31f7789a28cb2bec | Stable ID: sid_nvX3VpXyzQ