Risks from Learned Optimization

paper

2019·arXiv·arxiv.org/abs/1906.01820

Authors

Evan Hubinger·Chris van Merwijk·Vladimir Mikulik·Joar Skalse·Scott Garrabrant

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational paper introducing mesa-optimization, analyzing risks when learned models become optimizers themselves, directly addressing transparency and safety concerns in advanced ML systems.

Paper Details

Citations

20 influential

Year

2025

Methodology

survey

arXiv:1906.01820 DOI:10.5194/egusphere-egu24-20731 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was trained under - and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

Summary

This paper introduces the concept of mesa-optimization, where a learned model (such as a neural network) functions as an optimizer itself. The authors analyze two critical safety concerns: (1) identifying when and why learned models become optimizers, and (2) understanding how a mesa-optimizer's objective function may diverge from its training loss and how to ensure alignment. The paper provides a comprehensive framework for understanding these phenomena and outlines important directions for future research in AI safety and transparency.

Cited by 17 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Deep Learning Revolution Era	Historical	44.0
AI Compounding Risks Analysis Model	Analysis	60.0
Deceptive Alignment Decomposition Model	Analysis	62.0
Goal Misgeneralization Probability Model	Analysis	61.0
Instrumental Convergence Framework	Analysis	60.0
Mesa-Optimization Risk Analysis	Analysis	61.0
Multipolar Trap Dynamics Model	Analysis	61.0
Evan Hubinger	Person	43.0
Scheming & Deception Detection	Approach	91.0
Sleeper Agent Detection	Approach	66.0
Deceptive Alignment	Risk	75.0
Instrumental Convergence	Risk	64.0
Mesa-Optimization	Risk	63.0
Sharp Left Turn	Risk	69.0
Treacherous Turn	Risk	67.0
AI Doomer Worldview	Concept	38.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[1906.01820] Risks from Learned Optimization in Advanced Machine Learning Systems 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Risks from Learned Optimization 
 in Advanced Machine Learning Systems

 
 
 Evan Hubinger
 Alphabetical order. Equal contribution.
 
 
 
 Chris van Merwijk 1 1 footnotemark: 1 
 
 
 
 
 Vladimir Mikulik 1 1 footnotemark: 1 
 
 
 
 
 Joar Skalse 1 1 footnotemark: 1 
 
 
 
 
 and Scott Garrabrant
 With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this paper.
 
 
 
 (June 11, 2019) 

 
 Abstract

 We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

 

 
 
 1 Introduction

 
 \cftchapterprecistoc 
 We introduce the concept of mesa-optimization as well as many relevant terms and concepts related to it, such as what makes a system an optimizer, the difference between base and mesa- optimizers, and our two key problems: unintended optimization and inner alignment.

 
 
 In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this paper, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through p

... (truncated, 98 KB total)

Resource ID: c4858d4ef280d8e6 | Stable ID: sid_fgn7RycVXH