Back
SWE-bench Official Leaderboards
webswebench.com·swebench.com/
SWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.
Metadata
Importance: 62/100tool pagetool
Summary
SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python repositories. It offers multiple dataset variants (Lite, Verified, Multimodal) and standardized metrics to compare coding agents. It has become a widely-used standard for assessing the practical software engineering capabilities of LLM-based agents.
Key Points
- •Evaluates AI models on real GitHub issues requiring code changes, testing whether agents can autonomously resolve software bugs and feature requests.
- •Offers multiple dataset variants including SWE-bench Lite (300 tasks), Verified (human-validated), and Multimodal versions for broader coverage.
- •Serves as the de facto leaderboard for comparing frontier AI coding agents, tracking rapid capability progress in autonomous software engineering.
- •Uses pass@1 resolution rate as primary metric, measuring the percentage of issues fully resolved by the agent without human intervention.
- •Relevant to AI safety as high performance signals growing agentic autonomy and tool-use capabilities with real-world consequences.
Review
SWE-bench represents a sophisticated benchmarking framework designed to rigorously evaluate AI models' capabilities in software engineering tasks. By offering multiple variants like Bash Only, Verified, Lite, and Multimodal datasets, the platform provides nuanced insights into AI agents' problem-solving abilities across different contexts and constraints. The benchmark's significance lies in its systematic approach to measuring AI performance, using a percentage resolved metric across varying dataset sizes (300-2294 instances). The project's collaborative nature, supported by major tech institutions like OpenAI, AWS, and Anthropic, underscores its importance in advancing AI software development capabilities. The ongoing development, including recent announcements about CodeClash and SWE-smith, suggests a dynamic and rapidly evolving evaluation ecosystem for AI coding agents.
Cited by 6 pages
| Page | Type | Quality |
|---|---|---|
| Agentic AI | Capability | 68.0 |
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
| Heavy Scaffolding / Agentic Systems | Concept | 57.0 |
| Minimal Scaffolding | Capability | 52.0 |
| AI Capability Threshold Model | Analysis | 72.0 |
Cached Content Preview
HTTP 200Fetched Apr 24, 20263 KB
SWE-bench Leaderboards Verified Multilingual Lite Full Multimodal Compare results Agent: mini-SWE-agent v2 mini-SWE-agent v0-v2 All OSS agents All agents Models: All models Open source only Proprietary only Filters: Open Scaffold ▼ All Tags ▼ Show results from older agent versions Compare results Resolved (bar chart) Resolved by repository Resolved by language Resolved instances matrix Resolved vs cost (scatter plot) Resolved vs model release date Resolved vs average cost Resolved vs cost limit Resolved vs step limit Cumulative cost distribution Cumulative cost distribution (resolved only) Cumulative step distribution Cumulative step distribution (resolved only) Light Selection JSON PNG Copy Link Instances 1-100 Instances 101-200 Instances 201-300 Instances 301-400 Instances 401-500 Select models via the checkboxes, then click Compare results . Compare models Compare selected models No models selected. Use quick select or go back to select models in the first column. Alternatively, use the quick select: Select top 10 Select top 20 Select all Select all (open weights) SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or view all agents [ Post ]. SWE-bench Multilingual features 300 tasks across 9 programming languages [ Post ]. SWE-bench Lite is a subset curated for less costly evaluation [ Post ]. SWE-bench Multimodal features issues with visual elements [ Post ]. Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal). Analyze Results in Detail News [11/2025] Introducing CodeClash, our new eval of LMs as goal (not task) oriented developers! [ Link ] [07/2025] mini-SWE-agent scores 65% on SWE-bench Verified in 100 lines of python code. [ Link ] [05/2025] SWE-smith is out! Train your own models for software engineering agents. [ Link ] [03/2025] SWE-agent 1.0 is the open source SOTA on SWE-bench Lite! [ Link ] [10/2024] Introducing SWE-bench Multimodal ! [ Link ] [08/2024] SWE-bench x OpenAI = SWE-bench Verified [ Report ] [06/2024] Docker -ized SWE-bench for easier evaluation [ Report ] [03/2024] Check out SWE-agent (12.47% on SWE-bench) [ Link ] [03/2024] Released SWE-bench Lite [ Report ] Acknowledgements We thank the following institutions for their generous support: Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.
Resource ID:
433a37bad4e66a78 | Stable ID: sid_oCT67b0fy6