benchmark
BBH
Metadata
| Source Table | benchmarks |
| Source ID | jp1Xu4jbIy |
| Description | BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning. |
| Wiki ID | bbh |
| Children | — |
| Created | Mar 14, 2026, 12:43 AM |
| Updated | Mar 24, 2026, 11:24 PM |
| Synced | Mar 24, 2026, 11:24 PM |
Record Data
id | jp1Xu4jbIy |
slug | bbh |
name | BBH |
category | reasoning |
description | BIG-Bench Hard — a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average human raters. Tests multi-step reasoning. |
website | — |
scoringMethod | accuracy |
higherIsBetter | Yes |
introducedDate | 2022-10 |
maintainer | Google / Stanford |
source | arxiv.org/abs/2210.09261 |
Debug info
Thing ID: jp1Xu4jbIy
Source Table: benchmarks
Source ID: jp1Xu4jbIy
Wiki ID: bbh