Longterm Wiki

HumanEval

Coding

A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.

Models Tested
10
Best Score
92%
Median Score
86.05000000000001%
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI

Leaderboard10 models

#ModelDeveloperScore
🥇Claude 3.5 SonnetAnthropic92%
🥈Mistral Large 2Mistral AI92%
🥉GPT-4oOpenAI90.2%
4Llama 3.1Meta AI (FAIR)89%
5GPT-4o miniOpenAI87.2%
6Claude 3 OpusAnthropic84.9%
7DeepSeek V3DeepSeek82.6%
8Llama 3Meta AI (FAIR)81.7%
9Gemini 1.0 UltraGoogle DeepMind74.4%
10GPT-4OpenAI67%