Underspecification in Machine Learning
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A highly influential paper from Google Brain demonstrating that ML reliability problems often stem from underspecification rather than overfitting, directly relevant to AI safety concerns about model predictability and deployment robustness.
Paper Details
Metadata
Abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
Summary
This paper introduces 'underspecification' as a fundamental problem in ML pipelines: many different models with equivalent training performance can behave very differently in deployment. The authors demonstrate that standard ML training procedures return a set of equally good models, with no mechanism to prefer more robust or reliable ones, leading to unpredictable real-world failures.
Key Points
- •ML pipelines are underspecified when training performance cannot distinguish between models that behave differently under deployment conditions.
- •Underspecification is widespread across domains including medical imaging, NLP, and genomics, causing silent failures in real-world settings.
- •Models that perform equivalently on held-out test sets can differ dramatically on stress tests, distribution shifts, and fairness metrics.
- •Standard practices like cross-validation and hyperparameter tuning do not resolve underspecification and may obscure it.
- •Addressing underspecification requires domain-specific inductive biases, stress testing, and explicit robustness criteria during training.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Long-Timelines Technical Worldview | Concept | 91.0 |
Cached Content Preview
[2011.03395] Underspecification Presents Challenges for Credibility in Modern Machine Learning
Underspecification Presents Challenges for Credibility in Modern Machine Learning
\name Alexander D’Amour \email alexdamour@google.com
\name Katherine Heller † † footnotemark: \email kheller@google.com
\name Dan Moldovan † † footnotemark: \email mdan@google.com
\name Ben Adlam \email adlam@google.com
\name Babak Alipanahi \email babaka@google.com
\name Alex Beutel \email alexbeutel@google.com
\name Christina Chen \email christinium@google.com
\name Jonathan Deaton \email jdeaton@google.com
\name Jacob Eisenstein \email jeisenstein@google.com
\name Matthew D. Hoffman \email mhoffman@google.com
\name Farhad Hormozdiari \email fhormoz@google.com
\name Neil Houlsby \email neilhoulsby@google.com
\name Shaobo Hou \email shaobohou@google.com
\name Ghassen Jerfel \email ghassen@google.com
\name Alan Karthikesalingam \email alankarthi@google.com
\name Mario Lucic \email lucic@google.com
\name Yian Ma \email yianma@ucsd.edu
\name Cory McLean \email cym@google.com
\name Diana Mincu \email dmincu@google.com
\name Akinori Mitani \email amitani@google.com
\name Andrea Montanari \email montanari@stanford.edu
\name Zachary Nado \email znado@google.com
\name Vivek Natarajan \email natviv@google.com
\name Christopher Nielson \email christopher.nielson@va.gov
\name Thomas F. Osborne † † footnotemark: \email thomas.osborne@va.gov
\name Rajiv Raman \email drrrn@snmail.org
\name Kim Ramasamy \email kim@aravind.org
\name Rory Sayres \email sayres@google.com
\name Jessica Schrouff \email schrouff@google.com
\name Martin Seneviratne \email martsen@google.com
\name Shannon Sequeira \email shnnn@google.com
\name Harini Suresh \email hsuresh@mit.edu
\name Victor Veitch \email victorveitch@google.com
\name Max Vladymyrov \email mxv@google.com
\name Xuezhi Wang \email xuezhiw@google.com
\name Kellie Webster \email websterk@google.com
\name Steve Yadlowsky \email yadlowsky@google.com
\name Taedong Yun \email tedyun@google.com
\name Xiaohua Zhai \email xzhai@google.com
\name D. Sculley \email dsculley@google.com
These authors contributed equally to this work.This paper represents the views of the authors, and not of the VA.
Abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior
... (truncated, 98 KB total)099b1261e607bc66 | Stable ID: sid_7x4leu0X5k