Back
SWE-bench Pro Leaderboard - Scale AI
webData Status
Full text fetchedFetched Dec 28, 2025
Summary
SWE-Bench Pro provides a comprehensive evaluation of AI agents' software engineering skills by sourcing tasks from public and private repositories. The benchmark addresses key limitations in existing benchmarks by focusing on realistic, challenging problem-solving scenarios.
Key Points
- •Addresses major limitations in existing software engineering AI benchmarks
- •Uses diverse, complex repositories from public and private sources
- •Reveals significant performance gaps among AI models
- •Provides a more realistic measure of AI problem-solving capabilities
Review
SWE-Bench Pro represents a significant advancement in AI agent evaluation for software engineering tasks. By addressing critical limitations in existing benchmarks, such as data contamination, limited task diversity, and oversimplified problems, the benchmark offers a more authentic assessment of AI problem-solving capabilities. The methodology involves a sophisticated four-stage workflow that carefully sources, creates, and augments software engineering challenges from diverse repositories. The benchmark's key innovation lies in its rigorous design, which includes three distinct dataset subsets: a public set, a commercial set, and a held-out set. This approach allows for comprehensive testing across different coding environments and provides a more nuanced understanding of AI agents' generalization abilities. The results are striking, with top models like OpenAI GPT-5 and Claude Opus 4.1 scoring only around 23% on the public dataset, compared to 70%+ on previous benchmarks. This dramatic performance drop highlights the benchmark's increased complexity and its potential to drive meaningful improvements in AI software engineering capabilities.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Autonomous Coding | Capability | 63.0 |
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
Resource ID:
9dbe484d48b6787a | Stable ID: Yzc2OTk5OT