Skip to content
Longterm Wiki
benchmark

BBH

Metadata

Source Tablebenchmarks
Source IDjp1Xu4jbIy
DescriptionBIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
Wiki IDbbh
Children
CreatedMar 14, 2026, 12:43 AM
UpdatedMar 24, 2026, 11:24 PM
SyncedMar 24, 2026, 11:24 PM

Record Data

idjp1Xu4jbIy
slugbbh
nameBBH
categoryreasoning
descriptionBIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning.
website
scoringMethodaccuracy
higherIsBetterYes
introducedDate2022-10
maintainerGoogle / Stanford
sourcearxiv.org/abs/2210.09261
Debug info

Thing ID: jp1Xu4jbIy

Source Table: benchmarks

Source ID: jp1Xu4jbIy

Wiki ID: bbh