benchmark
HellaSwag
Metadata
| Source Table | benchmarks |
| Source ID | nD2CFoyeBf |
| Description | A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs. |
| Wiki ID | hellaswag |
| Children | — |
| Created | Mar 14, 2026, 12:43 AM |
| Updated | Mar 24, 2026, 11:24 PM |
| Synced | Mar 24, 2026, 11:24 PM |
Record Data
id | nD2CFoyeBf |
slug | hellaswag |
name | HellaSwag |
category | reasoning |
description | A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs. |
website | — |
scoringMethod | accuracy |
higherIsBetter | Yes |
introducedDate | 2019-05 |
maintainer | AI2 / University of Washington |
source | arxiv.org/abs/1905.07830 |
Debug info
Thing ID: nD2CFoyeBf
Source Table: benchmarks
Source ID: nD2CFoyeBf
Wiki ID: hellaswag