Skip to content
Longterm Wiki
Back

SWE-bench Pro Leaderboard - Scale AI

web

Useful for tracking the state of AI coding agent capabilities; relevant to discussions of AI autonomy, capability evaluations, and the pace of progress toward AI systems that can perform complex software engineering tasks independently.

Metadata

Importance: 45/100tool pagereference

Summary

SWE-bench Pro is a rigorous benchmark by Scale AI that evaluates AI agents on real-world software engineering tasks drawn from both public and private repositories. It addresses limitations of existing benchmarks by emphasizing realistic, challenging problem-solving scenarios. The leaderboard tracks and compares performance of leading AI coding agents.

Key Points

  • Evaluates AI agents on software engineering tasks sourced from public and private repositories, reducing benchmark contamination risks
  • Designed to address limitations of prior coding benchmarks by focusing on realistic, difficult problem-solving scenarios
  • Provides a public leaderboard ranking AI agent performance on software engineering tasks
  • Relevant for assessing current AI coding capabilities and tracking progress toward autonomous software development
  • Produced by Scale AI, a major player in AI data labeling and evaluation infrastructure

Review

SWE-Bench Pro represents a significant advancement in AI agent evaluation for software engineering tasks. By addressing critical limitations in existing benchmarks, such as data contamination, limited task diversity, and oversimplified problems, the benchmark offers a more authentic assessment of AI problem-solving capabilities. The methodology involves a sophisticated four-stage workflow that carefully sources, creates, and augments software engineering challenges from diverse repositories. The benchmark's key innovation lies in its rigorous design, which includes three distinct dataset subsets: a public set, a commercial set, and a held-out set. This approach allows for comprehensive testing across different coding environments and provides a more nuanced understanding of AI agents' generalization abilities. The results are striking, with top models like OpenAI GPT-5 and Claude Opus 4.1 scoring only around 23% on the public dataset, compared to 70%+ on previous benchmarks. This dramatic performance drop highlights the benchmark's increased complexity and its potential to drive meaningful improvements in AI software engineering capabilities.

Cited by 3 pages

PageTypeQuality
Autonomous CodingCapability63.0
Long-Horizon Autonomous TasksCapability65.0
Tool Use and Computer UseCapability67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202611 KB
SWE-Bench Pro Leaderboard AI Coding Benchmark (Public Dataset) | Scale Leaderboard 2025 Scale AI. All rights reserved. BACK SWE-Bench Pro (Public Dataset)

 Evaluating challenging long-horizon software engineering tasks in public open source repositories

 SWE-Bench Pro

 SWE-Bench Pro is a benchmark designed to provide a rigorous and realistic evaluation of AI agents for software engineering. It was developed to address several limitations in existing benchmarks by tackling four key challenges:

 Data Contamination : Models have likely seen the evaluation code during training, making it hard to know if they are problem-solving or recalling a memorized solution.

 Limited Task Diversity : Many benchmarks fail to capture the full spectrum of real-world software challenges and instead focus on simple utility libraries.

 Oversimplified Problems : Ambiguous or underspecified issues are often removed from benchmarks, which doesn't reflect a real developer's workflow.

 Unreliable and Irreproducible Testing : Inconsistent setups make it difficult to know if a solution truly works or if the environment is just configured incorrectly.

 SWE-Bench Pro addresses these gaps by sourcing tasks from diverse and complex codebases, including consumer applications, B2B services, and developer tools. To reduce contamination risk, the public and held-out OSS subsets use strong copyleft licenses (e.g., GPL). The private subset consists of proprietary codebases from startup partners. 

 The benchmark is significantly more challenging than its predecessors; top models score around 23% on the SWE-Bench Pro public set, compared to 70%+ on SWE-Bench Verified. This provides a more accurate measure of an agent’s true problem-solving capabilities in environments that mirror professional software development.

 Read the paper here: https://scale.com/research/swe_bench_pro 

 Check out the GitHub and view trajectories here: https://docent.transluce.org/dashboard/032fb63d-4992-4bfc-911d-3b7dafcb931f 

 Methodology

 Each problem in SWE-Bench Pro is created using a four-stage workflow:

 Sourcing : Repositories are selected from a curated set of public and private repositories

 Environment Creation : Professional engineers build reproducible Docker-based environments, integrating all dependencies and build tools to ensure the codebase and tests run out-of-the-box.

 Harvesting : Problems are extracted via commit scraping. Pairs of consecutive commits are retained if they (a) fix a bug or introduce a feature, (b) demonstrate a fail-to-pass transition for new tests, and (c) include pass-to-pass tests confirming unrelated functionality remains intact.

 Augmentation : Human experts organize unstructured commits and issue metadata into two artifacts: a problem statement and a requirements brief with an optional interface. These provide sufficient context to reproduce the gold patch without prescribing an implementation. We employ three human-in-the-loop checkpoints: (1)

... (truncated, 11 KB total)
Resource ID: 9dbe484d48b6787a | Stable ID: Yzc2OTk5OT