Longterm Wiki

HumanEval

Coding

A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.

Models Tested
0
Scoring: pass_at_1
Introduced: 2021-07
Maintainer: OpenAI
No model scores recorded for this benchmark yet.