BIG-Bench evaluation suite
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: GitHub
BIG-Bench is widely cited in AI safety research for evaluating emergent and unpredictable capabilities in large language models, making it relevant to capability forecasting and AI risk assessment.
Metadata
Summary
BIG-Bench is a collaborative benchmark consisting of 204+ diverse tasks designed to probe large language model capabilities beyond existing benchmarks. It focuses on tasks believed to be difficult for current models, covering reasoning, knowledge, and common sense, and includes analysis of scaling behavior and emergent capabilities. The benchmark was contributed to by over 400 researchers across 130+ institutions.
Key Points
- •Contains 204+ tasks spanning diverse domains including language, mathematics, logic, social reasoning, and specialized knowledge
- •Designed to identify tasks where LLM performance is unpredictable or emergent as model scale increases
- •Collaborative open-source project with contributions from hundreds of researchers worldwide
- •Includes BIG-Bench Hard (BBH) subset of 23 tasks where models underperform average human raters
- •Key resource for studying capability elicitation, scaling laws, and identifying frontier model limitations
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Emergent Capabilities | Risk | 61.0 |
Cached Content Preview
GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models · GitHub
Skip to content
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
google
/
BIG-bench
Public
Notifications
You must be signed in to change notification settings
Fork
617
Star
3.2k
main Branches Tags Go to file Code Open more actions menu Folders and files
Name Name Last commit message Last commit date Latest commit
History
5,893 Commits 5,893 Commits .github/ workflows .github/ workflows bigbench bigbench bleurt bleurt docs docs notebooks notebooks scripts scripts .gitattributes .gitattributes .gitignore .gitignore .pre-commit-config.yaml .pre-commit-config.yaml LICENSE LICENSE MANIFEST.in MANIFEST.in README.md README.md keywords.md keywords.md requirements.txt requirements.txt setup.cfg setup.cfg setup.py setup.py View all files Repository files navigation
BIG-bench 🪑
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative
benchmark intended to probe large language models and extrapolate their future
capabilities.
The more than 200 tasks included in BIG-bench are summarized by keyword here , and by task name here . A paper introducing the benchmark, including evaluation results on large language models, is currently under review, and is available as a preprint .
The benchmark organizers can be contacted at bigbench@googlegroups.com .
Table of contents
BIG-bench Lite leaderboard
Quick start
Installation
How do I create a task?
Creating a programmatic task
Submitting a model evaluation
Frequently asked questions
Alan Turing sitting on a bench
For more details about the benchmark, see our detailed instructions .
BIG-bench Lite leaderboard
BIG-bench Lite (BBL) is a small subset of 24 diverse JSON tasks from BIG-bench.
It is designed to provide a canonical measure of model performance, while being far
cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench.
A leaderboard of current model performance on BBL is shown below.
To add new model results to the full BIG-bench leaderboard, to the BBL leaderboard, and to individual task performance plots, open a PR which includes the score files generated when you e
... (truncated, 16 KB total)cbf6b1d02f9255db | Stable ID: sid_CMsiD9Vsy0