BIG-Bench evaluation suite

web

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: GitHub

BIG-Bench is widely cited in AI safety research for evaluating emergent and unpredictable capabilities in large language models, making it relevant to capability forecasting and AI risk assessment.

Metadata

Importance: 72/100tool pagedataset

Summary

BIG-Bench is a collaborative benchmark consisting of 204+ diverse tasks designed to probe large language model capabilities beyond existing benchmarks. It focuses on tasks believed to be difficult for current models, covering reasoning, knowledge, and common sense, and includes analysis of scaling behavior and emergent capabilities. The benchmark was contributed to by over 400 researchers across 130+ institutions.

Key Points

•Contains 204+ tasks spanning diverse domains including language, mathematics, logic, social reasoning, and specialized knowledge
•Designed to identify tasks where LLM performance is unpredictable or emergent as model scale increases
•Collaborative open-source project with contributions from hundreds of researchers worldwide
•Includes BIG-Bench Hard (BBH) subset of 23 tasks where models underperform average human raters
•Key resource for studying capability elicitation, scaling laws, and identifying frontier model limitations

Cited by 1 page

Page	Type	Quality
Emergent Capabilities	Risk	61.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202616 KB

GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models · GitHub 

 
 
 
 

 
 

 

 

 
 

 
 

 

 

 

 

 

 

 

 

 

 

 
 
 

 
 
 

 

 

 
 
 
 

 

 

 

 

 

 

 
 

 

 

 
 

 
 
 

 
 

 

 

 
 
 
 

 
 Skip to content 

 
 
 
 
 
 

 
 
 
 
 

 

 

 

 
 
 
 
 
 You signed in with another tab or window. Reload to refresh your session. 
 You signed out in another tab or window. Reload to refresh your session. 
 You switched accounts on another tab or window. Reload to refresh your session. 

 
 
 
 Dismiss alert 

 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 {{ message }} 

 
 
 
 
 

 

 
 
 
 
 
 

 

 

 

 

 
 
 
 
 
 
 
 
 
 google
 
 / 
 
 BIG-bench 
 

 Public 
 

 

 
 
 
 

 
 
 
 Notifications
 You must be signed in to change notification settings 

 

 
 
 
 Fork
 617 
 
 

 
 
 
 
 
 Star
 3.2k 
 
 

 

 
 

 
 

 

 
 

 
 
 

 
 
 

 
 
 
   main Branches Tags Go to file Code Open more actions menu Folders and files

 Name Name Last commit message Last commit date Latest commit

   History

 5,893 Commits 5,893 Commits .github/ workflows .github/ workflows     bigbench bigbench     bleurt bleurt     docs docs     notebooks notebooks     scripts scripts     .gitattributes .gitattributes     .gitignore .gitignore     .pre-commit-config.yaml .pre-commit-config.yaml     LICENSE LICENSE     MANIFEST.in MANIFEST.in     README.md README.md     keywords.md keywords.md     requirements.txt requirements.txt     setup.cfg setup.cfg     setup.py setup.py     View all files Repository files navigation

 BIG-bench 🪑

 
 The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative 
benchmark intended to probe large language models and extrapolate their future
capabilities.
The more than 200 tasks included in BIG-bench are summarized by keyword here , and by task name here . A paper introducing the benchmark, including evaluation results on large language models, is currently under review, and is available as a preprint .

 The benchmark organizers can be contacted at bigbench@googlegroups.com .

 Table of contents 

 
 BIG-bench Lite leaderboard 

 Quick start 

 Installation 

 How do I create a task? 

 Creating a programmatic task 

 Submitting a model evaluation 

 Frequently asked questions 

 Alan Turing sitting on a bench 

 
 For more details about the benchmark, see our detailed instructions .

 BIG-bench Lite leaderboard

 
 BIG-bench Lite (BBL) is a small subset of 24 diverse JSON tasks from BIG-bench.
It is designed to provide a canonical measure of model performance, while being far
cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench.
A leaderboard of current model performance on BBL is shown below.
To add new model results to the full BIG-bench leaderboard, to the BBL leaderboard, and to individual task performance plots, open a PR which includes the score files generated when you e

... (truncated, 16 KB total)

Resource ID: cbf6b1d02f9255db | Stable ID: sid_CMsiD9Vsy0