Skip to content
Longterm Wiki

SWE-bench Verified

Coding

A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.

Models Tested
21
Best Score
80.9%
Median Score
68.1%
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP

Leaderboard21 models

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic
80.9%
🥈Claude Opus 4.6Anthropic
80.8%
🥉Claude Sonnet 4.6Anthropic
79.6%
4Claude Sonnet 4.5Anthropic
77.2%
5Claude Opus 4.1Anthropic
74.5%
6Claude Haiku 4.5Anthropic
73.3%
7Claude Sonnet 4Anthropic
72.7%
8Claude Opus 4Anthropic
72.5%
9Claude 3.7 SonnetAnthropic
70.3%
10o3OpenAI
69.1%
11o4-miniOpenAI
68.1%
12Gemini 2.5 ProGoogle DeepMind
63.8%
13Gemini 2.5 FlashGoogle DeepMind
60.4%
14GPT-4.1OpenAI
54.6%
15Grok-3xAI
53.2%
16o3-miniOpenAI
49.3%
17DeepSeek R1DeepSeek
49.2%
18Claude 3.5 SonnetAnthropic
49%
19o1OpenAI
48.9%
20DeepSeek V3DeepSeek
42%
21Claude 3.5 HaikuAnthropic
40.6%