HellaSwag

benchmark

Metadata

Source Table	`benchmarks`
Source ID	`nD2CFoyeBf`
Description	A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.
Wiki ID	hellaswag
Children	—
Created	Mar 14, 2026, 12:43 AM
Updated	Mar 24, 2026, 11:24 PM
Synced	Mar 24, 2026, 11:24 PM

Record Data

`id`	nD2CFoyeBf
`slug`	hellaswag
`name`	HellaSwag
`category`	reasoning
`subCategory`	—
`description`	A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.
`website`	—
`scoringMethod`	accuracy
`higherIsBetter`	Yes
`introducedDate`	2019-05
`maintainer`	AI2 / University of Washington
`source`	arxiv.org/abs/1905.07830

Debug info

Thing ID: nD2CFoyeBf

Source Table: benchmarks

Source ID: nD2CFoyeBf

Wiki ID: hellaswag