Skip to content
Longterm Wiki
Back

SWE-bench Official Leaderboards

web
swebench.com·swebench.com/

SWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.

Metadata

Importance: 62/100tool pagetool

Summary

SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python repositories. It offers multiple dataset variants (Lite, Verified, Multimodal) and standardized metrics to compare coding agents. It has become a widely-used standard for assessing the practical software engineering capabilities of LLM-based agents.

Key Points

  • Evaluates AI models on real GitHub issues requiring code changes, testing whether agents can autonomously resolve software bugs and feature requests.
  • Offers multiple dataset variants including SWE-bench Lite (300 tasks), Verified (human-validated), and Multimodal versions for broader coverage.
  • Serves as the de facto leaderboard for comparing frontier AI coding agents, tracking rapid capability progress in autonomous software engineering.
  • Uses pass@1 resolution rate as primary metric, measuring the percentage of issues fully resolved by the agent without human intervention.
  • Relevant to AI safety as high performance signals growing agentic autonomy and tool-use capabilities with real-world consequences.

Review

SWE-bench represents a sophisticated benchmarking framework designed to rigorously evaluate AI models' capabilities in software engineering tasks. By offering multiple variants like Bash Only, Verified, Lite, and Multimodal datasets, the platform provides nuanced insights into AI agents' problem-solving abilities across different contexts and constraints. The benchmark's significance lies in its systematic approach to measuring AI performance, using a percentage resolved metric across varying dataset sizes (300-2294 instances). The project's collaborative nature, supported by major tech institutions like OpenAI, AWS, and Anthropic, underscores its importance in advancing AI software development capabilities. The ongoing development, including recent announcements about CodeClash and SWE-smith, suggests a dynamic and rapidly evolving evaluation ecosystem for AI coding agents.

Cited by 6 pages

Cached Content Preview

HTTP 200Fetched Apr 24, 20263 KB
SWE-bench Leaderboards 
 
 
 
 
 
 
 

 
 
 

 
 
 
 

 

 

 

 
 
 
 Verified 
 Multilingual 
 Lite 
 Full 
 Multimodal 
 
 

 
 

 
 
 
 
 
 
 
 
 Compare results 
 
 

 
 
 Agent: 
 
 mini-SWE-agent v2 
 mini-SWE-agent v0-v2 
 All OSS agents 
 All agents 
 
 Models: 
 
 All models 
 Open source only 
 Proprietary only 
 
 

 
 
 Filters: 
 
 
 
 
 Open Scaffold 
 ▼ 
 
 
 
 
 
 
 
 
 

 
 
 
 All Tags 
 ▼ 
 
 
 
 
 
 
 
 
 
 
 
 
 Show results from older agent versions
 
 
 
 
 

 
 
 
 

 

 
 
 
 
 
 Compare results

 
 
 
 
 
 Resolved (bar chart) 
 Resolved by repository 
 Resolved by language 
 Resolved instances matrix 
 Resolved vs cost (scatter plot) 
 Resolved vs model release date 
 Resolved vs average cost 
 Resolved vs cost limit 
 Resolved vs step limit 
 Cumulative cost distribution 
 Cumulative cost distribution (resolved only) 
 Cumulative step distribution 
 Cumulative step distribution (resolved only) 
 
 
 Light
 
 
 Selection
 
 
 JSON
 
 
 PNG
 
 
 Copy Link
 
 
 
 
 Instances 1-100 
 Instances 101-200 
 Instances 201-300 
 Instances 301-400 
 Instances 401-500 
 
 
 
 
 Select models via the checkboxes, then click Compare results . 
 
 
 
 
 
 

 
 
 
 
 
 Compare models

 
 
 
 
 Compare selected models
 
 No models selected. Use quick select or go back to select models in the first column.

 Alternatively, use the quick select:

 
 Select top 10 
 Select top 20 
 Select all 
 Select all (open weights) 
 
 
 
 

 
 
 SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or view all agents [ Post ].

 SWE-bench Multilingual features 300 tasks across 9 programming languages [ Post ].

 SWE-bench Lite is a subset curated for less costly evaluation [ Post ].

 SWE-bench Multimodal features issues with visual elements [ Post ].

 

 Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal).

 
 
 
 Analyze Results in Detail 
 
 
 

 
 
 News

 
 [11/2025] Introducing CodeClash, our new eval of LMs as goal (not task) oriented developers! [ Link ]

 [07/2025] mini-SWE-agent scores 65% on SWE-bench Verified in 100 lines of python code. [ Link ]

 [05/2025] SWE-smith is out! Train your own models for software engineering agents. [ Link ]

 [03/2025] SWE-agent 1.0 is the open source SOTA on SWE-bench Lite! [ Link ]

 [10/2024] Introducing SWE-bench Multimodal ! [ Link ]

 [08/2024] SWE-bench x OpenAI = SWE-bench Verified [ Report ]

 [06/2024] Docker -ized SWE-bench for easier evaluation [ Report ]

 [03/2024] Check out SWE-agent (12.47% on SWE-bench) [ Link ]

 [03/2024] Released SWE-bench Lite [ Report ]

 
 
 

 
 
 Acknowledgements

 
 We thank the following institutions for their generous support:
 Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.
Resource ID: 433a37bad4e66a78 | Stable ID: sid_oCT67b0fy6