Back
Time Horizon 1.1 - METR
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: METR
Metadata
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Eval Saturation & The Evals Gap | Approach | 65.0 |
| Scalable Eval Approaches | Approach | 65.0 |
Cached Content Preview
HTTP 200Fetched Apr 30, 202616 KB
[](https://metr.org/)
- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)
Menu
**We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.**
Our estimates of time horizons for many models have been updated. The new estimates generally fall within our existing confidence intervals, though the trend in time horizon growth looks a little different, discussed below. We expect to make more changes to our evaluation protocols so that we can capture the continued rapid growth in capabilities.
* * *
**Early in 2025 we published our [time-horizon methodology](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) for measuring the autonomous capabilities of AI models.**
We found a steady exponential increase in models’ human-equivalent “time horizon.” Over the course of 2025 we applied this methodology to newer models and measured a rate of increase consistent with historical trends.
**We are rolling out two significant changes to our time-horizon evaluation setup:**
1. Improvements to our task suite. We increased our suite from 170 to 228 tasks. We added 73 tasks (all are from HCAST, described in [Rein (2025)](https://metr.org/hcast.pdf)), removed 15 tasks, and updated 53 tasks (27 tasks had an updated definition, 13 tasks had an updated human time estimate, and 13 had both). We increased the number of long tasks (estimated to take humans 8 or more hours) from 14 to 31. The additions represent HCAST tasks that were not included in the original time horizon paper, but subsequently passed our quality check processes. The modifications and removals generally represent cases where a task description was confusing, or easy to reward-hack, or the scoring function had errors.
2. A move of our evaluation infrastructure from Vivaria to Inspect. We developed Vivaria in-house in 2023. Inspect is a widely-adopted open-source framework for AI evaluations developed by the UK AI Security Institute.
**The estimated time horizon for each model has changed somewhat.**
We have re-estimated the effective time horizons for 14 models, using performance on our new TH1.1 task suite and evaluation infrastructure. The new estimates generally lie within the confidence intervals from the TH1 time horizons.
We re-estimated horizons for only 14 of 33 models which had TH1 estimates. The smaller set of models in TH1.1 is for a variety of reasons, including (i) the model no longer being publicly available, (ii) the model requiring significant changes to the tool-calling scaffold (e.g. for GPT-2, GPT-3, and GPT-3.5), or (iii) because the model was far from the capability frontier at the time of release, so unlikely to change the estimated trend.
Changes in the estimated ti
... (truncated, 16 KB total)Resource ID:
31f7789a28cb2bec | Stable ID: sid_nvX3VpXyzQ