HumanEval: Hand-Written Evaluation Set for Code Generation
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: GitHub
HumanEval is widely used to benchmark code generation capabilities of LLMs; relevant to AI safety discussions around capability measurement, evaluation robustness, and tracking AI progress in software engineering tasks.
Metadata
Summary
HumanEval is OpenAI's open-source benchmark dataset for evaluating the functional correctness of code generated by language models. It consists of 164 hand-crafted Python programming problems with unit tests, used to measure how well AI systems can synthesize code from docstrings. It was introduced alongside the Codex paper and has become a standard benchmark in the field.
Key Points
- •Contains 164 original Python programming problems with function signatures, docstrings, and unit tests for automated evaluation
- •Measures functional correctness of generated code using a pass@k metric rather than syntactic similarity
- •Introduced with OpenAI's Codex model and has become an industry-standard benchmark for code generation capability
- •Open-source release enables reproducible comparisons across different code-generating AI models
- •Represents a capability evaluation tool relevant to tracking AI progress in code synthesis tasks
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Minimal Scaffolding | Capability | 52.0 |
Cached Content Preview
GitHub - openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code" · GitHub
Skip to content
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
openai
/
human-eval
Public
Notifications
You must be signed in to change notification settings
Fork
440
Star
3.2k
master Branches Tags Go to file Code Open more actions menu Folders and files
Name Name Last commit message Last commit date Latest commit
History
7 Commits 7 Commits data data human_eval human_eval LICENSE LICENSE README.md README.md requirements.txt requirements.txt setup.py setup.py View all files Repository files navigation
HumanEval: Hand-Written Evaluation Set
This is an evaluation harness for the HumanEval problem solving dataset
described in the paper " Evaluating Large Language Models Trained on
Code ".
Installation
Make sure to use python 3.7 or later:
$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage
This program exists to run untrusted model-generated code. Users are strongly
encouraged not to do so outside of a robust security sandbox. The execution
call
in execution.py is deliberately commented out to ensure users read this
disclaimer before running code in a potentially unsafe manner. See the comment in
execution.py for more information and instructions.
After following the above instructions to enable execution, generate samples
and save them in the following JSON Lines (jsonl) format, where each sample is
formatted into a single line like so:
{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data
to illustrate the format and help with debugging.
Here is nearly functional example code (you just have to provide
generate_one_completion to make it work) that saves generated completions to
samples.jsonl .
from human_eval.data import write_jsonl, read_problems
problems = read_problems()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
... (truncated, 7 KB total)9edbbd4ae30cd1f8 | Stable ID: sid_sttG9Qpsti