SWE-bench Verified - OpenAI
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
SWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.
Metadata
Summary
OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verified. This improved benchmark provides more reliable and fair evaluations of AI models' ability to solve real-world software engineering tasks. It addresses concerns that inflated or misleading scores on the original benchmark obscured true model capabilities.
Key Points
- •Original SWE-bench contained problematic test cases that could produce misleading scores; SWE-bench Verified filters these out via human review.
- •Human software developers were recruited to validate samples, ensuring tasks are solvable and test requirements are unambiguous.
- •The verified subset enables more meaningful comparison of AI coding agents on realistic GitHub issue-resolution tasks.
- •Improves benchmark integrity by removing underspecified, flawed, or untestable samples from the evaluation set.
- •Relevant for tracking progress in agentic AI coding capabilities, a key frontier for AI safety and deployment considerations.
Review
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| Agentic AI | Capability | 68.0 |
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
| Minimal Scaffolding | Capability | 52.0 |
Cached Content Preview
Introducing SWE-bench Verified | OpenAI
Mar
APR
May
07
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Save Page Now Outlinks
TIMESTAMPS
The Wayback Machine - http://web.archive.org/web/20260407064632/https://openai.com/index/introducing-swe-bench-verified/
Skip to main content
li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] focus-within:overflow-visible [&::-webkit-scrollbar]:hidden">
Research
Products
Business
Developers
Company
Foundation(opens in a new window)
Log in
Try ChatGPT
(opens in a new window)
Research
Products
Business
Developers
Company
Foundation
(opens in a new window)
Try ChatGPT
(opens in a new window)Login
OpenAI
Table of contents
Background on SWE-bench
Adapting SWE-bench as a Preparedness Evaluation
SWE-bench Verified
Our Approach
Annotation Results
Performance on SWE-bench Verified
Discussion & Limitations
Data downloads
August 13, 2024
Milestone
Introducing SWE-bench Verified
We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.
Download SWE-bench Verified
(opens in a new window)
Loading…
Share
Updated February 24, 2025
As part of our Preparedness Framework, OpenAI develops a range of metrics to track, evaluate, and forecast models’ abilities to act autonomously. The ability to autonomously complete software engineering tasks is a key component of our Medium risk level in the Model Autonomy risk category. Evaluating these capabilities is challenging due to the complexity of software engineering tasks, the difficulty of accurately assessing generated code, and the challenge of simulating real-world development scenarios. Therefore, our approach to Preparedness must also involve careful examination of evaluations themselves, to reduce the potential for underestimating or overestimating performance in important risk categories.
One of the most popular evaluation suites for software engineering is SWE-bench(opens in a new window)1—a benchmark for evaluating large language models’ (LLMs’) abilities to solve real-world software issues sourced from GitHub. The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue. Coding agents have made impressive progress on SWE-bench, with top scoring agents scoring 20% on SWE-bench and 43% on SWE-bench Lite according to the SWE-bench leaderboard(opens in a new window) as of August 5, 2024.
Our testing identified some SWE-bench tasks which may be hard or impossible to solve, leading to SWE-bench systematically underestimating models’ aut
... (truncated, 23 KB total)e1f512a932def9e2 | Stable ID: sid_gHpYPycDK1