GitHub - openai/prm800k: 800,000 step-level correctness labels on LLM solutions to MATH problems · GitHub
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: GitHub
This dataset is the empirical foundation for OpenAI's work on process supervision vs. outcome supervision; closely related to debates about scalable oversight, recursive reward modeling, and catching reasoning errors in capable AI systems.
Metadata
Summary
PRM800K is a dataset released by OpenAI containing 800,000 step-level human correctness labels on large language model solutions to MATH competition problems. It supports training and evaluating process reward models (PRMs), which provide feedback on individual reasoning steps rather than final answers. This dataset underpins research into process supervision as a method for improving LLM reasoning reliability and safety.
Key Points
- •Contains 800,000 step-level correctness labels on LLM-generated solutions to MATH benchmark problems, enabling granular supervision of reasoning chains.
- •Supports training Process Reward Models (PRMs) that score each step of a solution, rather than only the final outcome.
- •Process supervision is proposed as a safer and more effective alternative to outcome supervision for catching subtle reasoning errors.
- •Released alongside OpenAI's research showing PRMs outperform outcome reward models on competitive math problem-solving benchmarks.
- •Relevant to AI safety as step-level feedback can help detect and reduce deceptive or flawed reasoning in LLMs.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Process Supervision | Approach | 65.0 |
| Scalable Oversight | Research Area | 68.0 |
Cached Content Preview
GitHub - openai/prm800k: 800,000 step-level correctness labels on LLM solutions to MATH problems · GitHub
Skip to content
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
openai
/
prm800k
Public
Notifications
You must be signed in to change notification settings
Fork
126
Star
2.1k
main Branches Tags Go to file Code Open more actions menu Folders and files
Name Name Last commit message Last commit date Latest commit
History
2 Commits 2 Commits prm800k prm800k .gitattributes .gitattributes LICENSE LICENSE README.md README.md setup.py setup.py View all files Repository files navigation
PRM800K: A Process Supervision Dataset
[Blog Post] [Paper]
This repository accompanies the paper Let's Verify Step by Step and presents the PRM800K dataset introduced there. PRM800K is a process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset. More information on PRM800K and the project can be found in the paper.
We are releasing the raw labels as well as the instructions we gave labelers during phase 1 and phase 2 of the project. Example labels can be seen in the image below.
Data
The data/ folder contains our labels formatted as newline-delimited lists of json data. The data has been uploaded with Git LFS , which you'll need to install in order to properly clone the repository.
Each line represents 1 full solution sample and can contain many step-level labels. Here is one annotated line:
{
// UUID representing a particular labeler.
"labeler" : "340d89bc-f5b7-45e9-b272-909ba68ee363" ,
// The timestamp this trajectory was submitted.
"timestamp" : "2023-01-22T04:34:27.052924" ,
// In phase 2, we split our data collection into generations, using our best
// PRM so far to pick which solutions to score in the next generation.
// In phase 1, this value should always be null.
"generation" : 9 ,
// In each generation, we reserve some solutions for quality control. We serve
// these solutions to every labeler, and check that they agree with our
// gold labels.
"is_quality_control_question" : false ,
// generation -1 was reserved for a set of 30 questions we served every
// labeler in order to screen for base task performance.
"is_initial_screening_question" : false ,
// Metadata
... (truncated, 12 KB total)eccb4758de07641b | Stable ID: sid_xcnS5dcMHy