Source: WILDS: A Benchmark of in-the-Wild Distribution Shifts

WILDS is a widely-used benchmark in ML robustness research; relevant to AI safety discussions about whether models maintain reliable performance when deployed in conditions differing from their training distribution.

Metadata

Importance: 62/100tool pagedataset

Summary

WILDS is a benchmark suite from Stanford designed to evaluate machine learning model robustness to real-world distribution shifts, covering scenarios where training and test data differ due to geographic, temporal, or demographic factors. It provides curated datasets across diverse domains (medical imaging, wildlife detection, text classification, etc.) to measure model generalization under distribution shift. The benchmark aims to close the gap between academic ML performance metrics and real-world deployment reliability.

Key Points

•Provides 10+ datasets spanning diverse domains where distribution shift is a practical, documented real-world problem rather than synthetic.
•Distinguishes between subpopulation shifts (same domain, different groups) and domain shifts (different environments/conditions).
•Enables standardized evaluation of how well models generalize to out-of-distribution data, critical for safe deployment.
•Includes unlabeled data splits to support and benchmark unsupervised and semi-supervised domain adaptation methods.
•Directly relevant to AI safety concerns about model reliability and performance degradation in deployment settings.

Cited by 1 page

Page	Type	Quality
AI Distributional Shift	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 10, 20261 KB

WILDS

The v2.0 update adds unlabeled data to 8 datasets. The labeled data and evaluation metrics are exactly the same, so all previous results are directly comparable. Read our release notes to find out more!">

A benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

The v2.0 update adds unlabeled data to 8 datasets. The labeled data and evaluation metrics are exactly the same, so all previous results are directly comparable. Read our release notes to find out more!

WILDS paper

Unlabeled data paper (v2)

Github

Get Started

Learn how to install and use our Python package, which provides a simple and standardized interface for all WILDS datasets.