Skip to content
Longterm Wiki
Back

Underspecification in Machine Learning

paper

Authors

Alexander D'Amour·Katherine Heller·Dan Moldovan·Ben Adlam·Babak Alipanahi·Alex Beutel·Christina Chen·Jonathan Deaton·Jacob Eisenstein·Matthew D. Hoffman·Farhad Hormozdiari·Neil Houlsby·Shaobo Hou·Ghassen Jerfel·Alan Karthikesalingam·Mario Lucic·Yian Ma·Cory McLean·Diana Mincu·Akinori Mitani·Andrea Montanari·Zachary Nado·Vivek Natarajan·Christopher Nielson·Thomas F. Osborne·Rajiv Raman·Kim Ramasamy·Rory Sayres·Jessica Schrouff·Martin Seneviratne·Shannon Sequeira·Harini Suresh·Victor Veitch·Max Vladymyrov·Xuezhi Wang·Kellie Webster·Steve Yadlowsky·Taedong Yun·Xiaohua Zhai·D. Sculley

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A highly influential paper from Google Brain demonstrating that ML reliability problems often stem from underspecification rather than overfitting, directly relevant to AI safety concerns about model predictability and deployment robustness.

Paper Details

Citations
815
44 influential
Year
2020

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

Summary

This paper introduces 'underspecification' as a fundamental problem in ML pipelines: many different models with equivalent training performance can behave very differently in deployment. The authors demonstrate that standard ML training procedures return a set of equally good models, with no mechanism to prefer more robust or reliable ones, leading to unpredictable real-world failures.

Key Points

  • ML pipelines are underspecified when training performance cannot distinguish between models that behave differently under deployment conditions.
  • Underspecification is widespread across domains including medical imaging, NLP, and genomics, causing silent failures in real-world settings.
  • Models that perform equivalently on held-out test sets can differ dramatically on stress tests, distribution shifts, and fairness metrics.
  • Standard practices like cross-validation and hyperparameter tuning do not resolve underspecification and may obscure it.
  • Addressing underspecification requires domain-specific inductive biases, stress testing, and explicit robustness criteria during training.

Cited by 1 page

PageTypeQuality
Long-Timelines Technical WorldviewConcept91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB
[2011.03395] Underspecification Presents Challenges for Credibility in Modern Machine Learning 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Underspecification Presents Challenges for Credibility in Modern Machine Learning

 
 
 \name Alexander D’Amour \email alexdamour@google.com
 \name Katherine Heller † † footnotemark: \email kheller@google.com
 \name Dan Moldovan † † footnotemark: \email mdan@google.com
 \name Ben Adlam \email adlam@google.com
 \name Babak Alipanahi \email babaka@google.com
 \name Alex Beutel \email alexbeutel@google.com
 \name Christina Chen \email christinium@google.com
 \name Jonathan Deaton \email jdeaton@google.com
 \name Jacob Eisenstein \email jeisenstein@google.com
 \name Matthew D. Hoffman \email mhoffman@google.com
 \name Farhad Hormozdiari \email fhormoz@google.com
 \name Neil Houlsby \email neilhoulsby@google.com
 \name Shaobo Hou \email shaobohou@google.com
 \name Ghassen Jerfel \email ghassen@google.com
 \name Alan Karthikesalingam \email alankarthi@google.com
 \name Mario Lucic \email lucic@google.com
 \name Yian Ma \email yianma@ucsd.edu
 \name Cory McLean \email cym@google.com
 \name Diana Mincu \email dmincu@google.com
 \name Akinori Mitani \email amitani@google.com
 \name Andrea Montanari \email montanari@stanford.edu
 \name Zachary Nado \email znado@google.com
 \name Vivek Natarajan \email natviv@google.com
 \name Christopher Nielson \email christopher.nielson@va.gov
 \name Thomas F. Osborne † † footnotemark: \email thomas.osborne@va.gov
 \name Rajiv Raman \email drrrn@snmail.org
 \name Kim Ramasamy \email kim@aravind.org
 \name Rory Sayres \email sayres@google.com
 \name Jessica Schrouff \email schrouff@google.com
 \name Martin Seneviratne \email martsen@google.com
 \name Shannon Sequeira \email shnnn@google.com
 \name Harini Suresh \email hsuresh@mit.edu
 \name Victor Veitch \email victorveitch@google.com
 \name Max Vladymyrov \email mxv@google.com
 \name Xuezhi Wang \email xuezhiw@google.com
 \name Kellie Webster \email websterk@google.com
 \name Steve Yadlowsky \email yadlowsky@google.com
 \name Taedong Yun \email tedyun@google.com
 \name Xiaohua Zhai \email xzhai@google.com
 \name D. Sculley \email dsculley@google.com
 These authors contributed equally to this work.This paper represents the views of the authors, and not of the VA. 
 

 
 Abstract

 ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior 

... (truncated, 98 KB total)
Resource ID: 099b1261e607bc66 | Stable ID: sid_7x4leu0X5k