HumanEval
CodingA benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.
Models Tested
10
Best Score
92%
Median Score
86.05000000000001%
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI
Leaderboard10 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3.5 Sonnet | Anthropic | 92% |
| 🥈 | Mistral Large 2 | Mistral AI | 92% |
| 🥉 | GPT-4o | OpenAI | 90.2% |
| 4 | Llama 3.1 | Meta AI (FAIR) | 89% |
| 5 | GPT-4o mini | OpenAI | 87.2% |
| 6 | Claude 3 Opus | Anthropic | 84.9% |
| 7 | DeepSeek V3 | DeepSeek | 82.6% |
| 8 | Llama 3 | Meta AI (FAIR) | 81.7% |
| 9 | Gemini 1.0 Ultra | Google DeepMind | 74.4% |
| 10 | GPT-4 | OpenAI | 67% |