HumanEval: Hand-Written Evaluation Set for Code Generation

web

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: GitHub

HumanEval is widely used to benchmark code generation capabilities of LLMs; relevant to AI safety discussions around capability measurement, evaluation robustness, and tracking AI progress in software engineering tasks.

Metadata

Importance: 62/100dataset

Summary

HumanEval is OpenAI's open-source benchmark dataset for evaluating the functional correctness of code generated by language models. It consists of 164 hand-crafted Python programming problems with unit tests, used to measure how well AI systems can synthesize code from docstrings. It was introduced alongside the Codex paper and has become a standard benchmark in the field.

Key Points

•Contains 164 original Python programming problems with function signatures, docstrings, and unit tests for automated evaluation
•Measures functional correctness of generated code using a pass@k metric rather than syntactic similarity
•Introduced with OpenAI's Codex model and has become an industry-standard benchmark for code generation capability
•Open-source release enables reproducible comparisons across different code-generating AI models
•Represents a capability evaluation tool relevant to tracking AI progress in code synthesis tasks

Cited by 1 page

Page	Type	Quality
Minimal Scaffolding	Capability	52.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20267 KB

GitHub - openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code" · GitHub 

 
 
 
 

 
 

 

 

 
 

 
 

 

 

 

 

 

 

 

 

 

 

 
 
 

 
 
 

 

 

 
 
 
 

 

 

 

 

 

 

 
 

 

 

 
 

 
 
 

 
 

 

 

 
 
 
 

 
 Skip to content 

 
 
 
 
 
 

 
 
 
 
 

 

 

 

 
 
 
 
 
 You signed in with another tab or window. Reload to refresh your session. 
 You signed out in another tab or window. Reload to refresh your session. 
 You switched accounts on another tab or window. Reload to refresh your session. 

 
 
 
 Dismiss alert 

 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 {{ message }} 

 
 
 
 
 

 

 
 
 
 
 
 

 

 

 

 

 
 
 
 
 
 
 
 
 
 openai
 
 / 
 
 human-eval 
 

 Public 
 

 

 
 
 
 

 
 
 
 Notifications
 You must be signed in to change notification settings 

 

 
 
 
 Fork
 440 
 
 

 
 
 
 
 
 Star
 3.2k 
 
 

 

 
 

 
 

 

 
 

 
 
 

 
 
 

 
 
 
   master Branches Tags Go to file Code Open more actions menu Folders and files

 Name Name Last commit message Last commit date Latest commit

   History

 7 Commits 7 Commits data data     human_eval human_eval     LICENSE LICENSE     README.md README.md     requirements.txt requirements.txt     setup.py setup.py     View all files Repository files navigation

 HumanEval: Hand-Written Evaluation Set

 
 This is an evaluation harness for the HumanEval problem solving dataset
described in the paper " Evaluating Large Language Models Trained on
Code ".

 Installation

 
 Make sure to use python 3.7 or later:

 $ conda create -n codex python=3.7
$ conda activate codex
 
 Check out and install this repository:

 $ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
 
 Usage

 
 This program exists to run untrusted model-generated code. Users are strongly
encouraged not to do so outside of a robust security sandbox. The execution
call 
in execution.py is deliberately commented out to ensure users read this
disclaimer before running code in a potentially unsafe manner. See the comment in
 execution.py for more information and instructions. 

 After following the above instructions to enable execution, generate samples
and save them in the following JSON Lines (jsonl) format, where each sample is
formatted into a single line like so:

 {"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
 
 We provide example_problem.jsonl and example_solutions.jsonl under data 
to illustrate the format and help with debugging.

 Here is nearly functional example code (you just have to provide
 generate_one_completion to make it work) that saves generated completions to
 samples.jsonl .

 from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
 dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
 for task_id in problems
 for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
 
 

... (truncated, 7 KB total)

Resource ID: 9edbbd4ae30cd1f8 | Stable ID: sid_sttG9Qpsti