benchmark
HumanEval
Metadata
| Source Table | benchmarks |
| Source ID | vxX2rorgxU |
| Description | A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings. |
| Wiki ID | humaneval |
| Children | — |
| Created | Mar 14, 2026, 12:43 AM |
| Updated | Mar 24, 2026, 11:24 PM |
| Synced | Mar 24, 2026, 11:24 PM |
Record Data
id | vxX2rorgxU |
slug | humaneval |
name | HumanEval |
category | coding |
description | A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings. |
website | — |
scoringMethod | pass_at_1 |
higherIsBetter | Yes |
introducedDate | 2021-07 |
maintainer | OpenAI |
source | arxiv.org/abs/2107.03374 |
Debug info
Thing ID: vxX2rorgxU
Source Table: benchmarks
Source ID: vxX2rorgxU
Wiki ID: humaneval