Time Horizon 1.1 - METR
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
Metadata
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Eval Saturation & The Evals Gap | Approach | 65.0 |
| Scalable Eval Approaches | Approach | 65.0 |
Cached Content Preview
HTTP 200Fetched May 31, 202615 KB
Time Horizon 1.1 - METR
Our Work
Research
Notes
Updates
Risk Assessment
About
Donate
Careers
Search
-->
Our Work
Research
Notes
Updates
Risk Assessment
About
Donate
Careers
Menu
×
Time Horizon 1.1
DATE
January 29, 2026
SHARE
Copy Link
Citation
BibTeX Citation
×
@misc { metr-2026-time-horizon-1-1 ,
title = {Time Horizon 1.1} ,
author = {METR} ,
howpublished = {\url{https://metr.org/blog/2026-1-29-time-horizon-1-1/}} ,
year = {2026} ,
month = {01} ,
}
Copy
We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.
Our estimates of time horizons for many models have been updated. The new estimates generally fall within our existing confidence intervals, though the trend in time horizon growth looks a little different, discussed below. We expect to make more changes to our evaluation protocols so that we can capture the continued rapid growth in capabilities.
Early in 2025 we published our time-horizon methodology for measuring the autonomous capabilities of AI models.
We found a steady exponential increase in models’ human-equivalent “time horizon.” Over the course of 2025 we applied this methodology to newer models and measured a rate of increase consistent with historical trends.
We are rolling out two significant changes to our time-horizon evaluation setup:
Improvements to our task suite. We increased our suite from 170 to 228 tasks. We added 73 tasks (all are from HCAST, described in Rein (2025) ), removed 15 tasks, and updated 53 tasks (27 tasks had an updated definition, 13 tasks had an updated human time estimate, and 13 had both). We increased the number of long tasks (estimated to take humans 8 or more hours) from 14 to 31. The additions represent HCAST tasks that were not included in the original time horizon paper, but subsequently passed our quality check processes. The modifications and removals generally represent cases where a task description was confusing, or easy to reward-hack, or the scoring function had errors.
A move of our evaluation infrastructure from Vivaria to Inspect. We developed Vivaria in-house in 2023. Inspe
... (truncated, 15 KB total)Resource ID:
31f7789a28cb2bec | Stable ID: sid_nvX3VpXyzQ