Skip to content
Longterm Wiki
Back

WILDS: A Benchmark of in-the-Wild Distribution Shifts

web
wilds.stanford.edu·wilds.stanford.edu/

WILDS is a widely-used benchmark in ML robustness research; relevant to AI safety discussions about whether models maintain reliable performance when deployed in conditions differing from their training distribution.

Metadata

Importance: 62/100tool pagedataset

Summary

WILDS is a benchmark suite from Stanford designed to evaluate machine learning model robustness to real-world distribution shifts, covering scenarios where training and test data differ due to geographic, temporal, or demographic factors. It provides curated datasets across diverse domains (medical imaging, wildlife detection, text classification, etc.) to measure model generalization under distribution shift. The benchmark aims to close the gap between academic ML performance metrics and real-world deployment reliability.

Key Points

  • Provides 10+ datasets spanning diverse domains where distribution shift is a practical, documented real-world problem rather than synthetic.
  • Distinguishes between subpopulation shifts (same domain, different groups) and domain shifts (different environments/conditions).
  • Enables standardized evaluation of how well models generalize to out-of-distribution data, critical for safe deployment.
  • Includes unlabeled data splits to support and benchmark unsupervised and semi-supervised domain adaptation methods.
  • Directly relevant to AI safety concerns about model reliability and performance degradation in deployment settings.

Cited by 1 page

PageTypeQuality
AI Distributional ShiftRisk91.0

Cached Content Preview

HTTP 200Fetched Apr 10, 20261 KB
WILDS 

 The v2.0 update adds unlabeled data to 8 datasets. The labeled data and evaluation metrics are exactly the same, so all previous results are directly comparable. Read our release notes to find out more!">

 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 A benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping. 

 The v2.0 update adds unlabeled data to 8 datasets. The labeled data and evaluation metrics are exactly the same, so all previous results are directly comparable. Read our release notes to find out more!

 
 
 
 
 
 WILDS paper
 
 
 
 
 
 
 Unlabeled data paper (v2)
 
 
 
 
 
 
 Github
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Get Started

 Learn how to install and use our Python package, which provides a simple and standardized interface for all WILDS datasets.

 
 
 Read more
 
 

 
 
 
 
 
 
 
 
 Datasets

 WILDS consists of 10 datasets across a diverse range of data modalities, applications, and types of distribution shifts. Explore the datasets here.

 
 
 Explore datasets
 
 

 
 
 
 
 
 
 
 
 Leaderboard

 We track the state-of-the-art on each dataset. View and submit your results here.

 
 
 View leaderboard
 
 

 
 
 
 
 
 
 
 
 Updates

 WILDS is under active development. View our updates here.

 
 
 View updates
 
 

 
 
 
 
 
 
 
 
 Team

 Contact us if you have any questions, feedback, or suggestions for WILDS, or if you are interested in contributing!

 
 
 Read more
Resource ID: f7c48e789ade0eeb | Stable ID: sid_Cb1HEQJQHC