Technical Performance - 2025 AI Index Report
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Stanford HAI
This annual Stanford HAI report is widely cited by researchers and policymakers tracking AI capability trends; relevant to AI safety discussions about the pace of progress and the adequacy of current evaluation frameworks.
Metadata
Summary
The Stanford HAI 2025 AI Index Report documents rapid advances in AI technical performance, including accelerating benchmark saturation, convergence across frontier model capabilities, and the emergence of new reasoning paradigms. It provides a comprehensive empirical overview of where AI systems stand relative to human-level performance across diverse tasks. The report serves as a key annual reference for tracking the pace and direction of AI capability progress.
Key Points
- •AI models are saturating established benchmarks faster than ever, compressing timelines between benchmark creation and near-human or superhuman performance.
- •Frontier models from different developers are converging in capability levels, reducing differentiation across leading labs.
- •New reasoning paradigms (e.g., chain-of-thought, test-time compute scaling) are emerging as important drivers of performance gains.
- •The report tracks performance across domains including coding, math, science, and multimodal tasks, providing a broad empirical baseline.
- •Rapid capability growth raises questions about evaluation methodology and whether existing benchmarks remain meaningful measures of AI progress.
Review
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Reasoning and Planning | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
| Is Scaling All You Need? | Crux | 42.0 |
| Emergent Capabilities | Risk | 61.0 |
Cached Content Preview
Technical Performance | The 2025 AI Index Report | Stanford HAI Skip to content
Navigate
About
Events
Careers
Search
Participate
Get Involved
Support HAI
Contact Us
Stay Up To Date
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
02 Technical Performance
Computer Vision Download Full Chapter See Chapter 3 All Chapters
Back to Overview
01 Research and Development
02 Technical Performance
03 Responsible AI
04 Economy
05 Science and Medicine
06 Policy and Governance
07 Education
08 Public Opinion
1. AI masters new benchmarks faster than ever.
In 2023, AI researchers introduced several challenging new benchmarks, including MMMU, GPQA, and SWE-bench, aimed at testing the limits of increasingly capable AI systems. By 2024, AI performance on these benchmarks saw remarkable improvements, with gains of 18.8 and 48.9 percentage points on MMMU and GPQA, respectively. On SWE-bench, AI systems could solve just 4.4% of coding problems in 2023—a figure that jumped to 71.7% in 2024.
2. Open-weight models catch up.
Last year’s AI Index revealed that leading open-weight models lagged significantly behind their closed-weight counterparts. By 2024, this gap had nearly disappeared. In early January 2024, the leading closed-weight model outperformed the top open-weight model by 8.04% on the Chatbot Arena Leaderboard. By February 2025, this gap had narrowed to 1.70%.
3. The gap between Chinese and US models closes.
In 2023, leading American models significantly outperformed their Chinese counterparts—a trend that no longer holds. At the end of 2023, performance gaps on benchmarks such as MMLU, MMMU, MATH, and HumanEval were 17.5, 13.5, 24.3, and 31.6 percentage points, respectively. By the end of 2024, these differences had narrowed substantially to just 0.3, 8.1, 1.6, and 3.7 percentage points.
4. AI model performance converges at the frontier.
According to last year’s AI Index, the Elo score difference between the top and 10th-ranked model on the Chatbot Arena Leaderboard was 11.9%. By early 2025, this gap had narrowed to just 5.4%. Likewise, the difference between the top two models shrank from 4.9% in 2023 to just 0.7% in 2024. The AI landscape is becoming increasingly competitive, with high-quality models now available from a growing number of developers.
5. New reasoning paradigms like test-time compute improve model performance.
In 2024, OpenAI introduced models like o1 and o3 that are designed to iteratively reason through their outputs. This test-time compute approach dramatically improved performance, with o1 scoring 74.4% on an International Mathematical Olympiad qualifying exam, compared to GPT-4o’s 9.3%. However, this enhanced reasoning comes at a cost: o1 is nearly six times more expensive and 30 times slower than GPT-4o.
6. More challenging benchmarks are continually pro
... (truncated, 5 KB total)1a26f870e37dcc68 | Stable ID: MTAwYTcxMT