Longterm Wiki

SWE-bench Verified

Coding

A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.

Models Tested
12
Best Score
80.9%
Median Score
68.6%
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP

Leaderboard12 models

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic80.9%
🥈Claude Sonnet 4.6Anthropic79.6%
🥉Claude Sonnet 4.5Anthropic77.2%
4Claude Sonnet 4Anthropic72.7%
5Claude Opus 4Anthropic72.5%
6o3OpenAI69.1%
7o4-miniOpenAI68.1%
8Gemini 2.5 ProGoogle DeepMind63.8%
9GPT-4.1OpenAI54.6%
10o3-miniOpenAI49.3%
11DeepSeek R1DeepSeek49.2%
12o1OpenAI48.9%