benchmark

BBH

Metadata

Source Table	`benchmarks`
Source ID	`jp1Xu4jbIy`
Description	BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
Wiki ID	bbh
Children	—
Created	Mar 14, 2026, 12:43 AM
Updated	Mar 24, 2026, 11:24 PM
Synced	Mar 24, 2026, 11:24 PM

Record Data

`id`	jp1Xu4jbIy
`slug`	bbh
`name`	BBH
`category`	reasoning
`description`	BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
`website`	—
`scoringMethod`	accuracy
`higherIsBetter`	Yes
`introducedDate`	2022-10
`maintainer`	Google / Stanford
`source`	arxiv.org/abs/2210.09261

Debug info

Thing ID: jp1Xu4jbIy

Source Table: benchmarks

Source ID: jp1Xu4jbIy

Wiki ID: bbh