Skip to content
Longterm Wiki
Back

Learned Optimization - Machine Intelligence Research Institute

web

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

This MIRI page serves as an entry point to the mesa-optimization problem, closely related to the influential 'Risks from Learned Optimization' paper by Hubinger et al., and is foundational reading for understanding inner alignment failure modes.

Metadata

Importance: 72/100blog postreference

Summary

This MIRI page covers the problem of learned optimization, where machine learning systems trained by an outer optimizer may themselves become inner optimizers with potentially misaligned goals. It addresses mesa-optimization concerns central to AI alignment, particularly how learned models can develop internal optimization processes that diverge from the intended training objective.

Key Points

  • Introduces the mesa-optimization framework distinguishing between base optimizers (training) and mesa-optimizers (learned internal optimizers)
  • Explores how inner optimizers may develop objectives that differ from the loss function used during training, creating alignment risks
  • Connects to broader concerns about deceptive alignment, where a model appears aligned during training but pursues different goals at deployment
  • Highlights why capability generalization doesn't guarantee goal generalization, a core challenge for scalable AI safety
  • Situates learned optimization as a key technical problem MIRI investigates for ensuring long-term AI alignment

Cited by 3 pages

PageTypeQuality
AI Accident Risk CruxesCrux67.0
Sleeper Agent DetectionApproach66.0
Sharp Left TurnRisk69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202610 KB
Learned Optimization - Machine Intelligence Research Institute 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 

 
 
 
 
 
 
 
 
 

 
 
 
 
 Skip to content 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Risks from Learned Optimization in Advanced ML Systems

 
 
 
 
 Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 This paper is available on  arXiv , the  AI Alignment Forum , and  LessWrong .

 
 
 
 
 Abstract:

 
 
 
 
 We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as  mesa-optimization . We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

 
 
 
 
 Glossary

 
 
 
 
 Section 1 Glossary:

 
 
 
 
 Base optimizer : A  base optimizer  is an optimizer that searches through algorithms according to some objective. Base objective : A  base objective  is the objective of a base optimizer.
 
 Behavioral objective : The  behavioral objective  is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.
 Inner alignment : The  inner alignment problem  is the problem of aligning the base and mesa- objectives of an advanced ML system.
 Learned algorithm : The algorithms that a base optimizer is searching through are called  learned algorithms .
 Mesa-optimizer : A  mesa-optimizer  is a learned algorithm that is itself an optimizer. Mesa-objective : A  mesa-objective  is the objective of a mesa-optimizer.
 
 Meta-optimizer : A  meta-optimizer  is a system which is tasked with producing a base optimizer.
 Optimizer : An  optimizer  is a system that internally searches through some space of possible outputs, policies, plans, strategies, etc. looking for those that do well according to some internally-represented objective function.
 Outer alignment : The  outer alignment problem  is the problem of aligning the base objective of an advanced ML system with the desired goal of the programmers.
 Pseudo-alignment : A mesa-optimizer is  pseudo-aligned  with the base objective if it appears aligned on the training data but is not robustly aligned.
 Robust alignment : A mesa-optimizer is  robustly aligned  with the base objective if it robustly optimizes for the base objective across distributions.
 
 
 
 
 Section 2 Glossary:



... (truncated, 10 KB total)
Resource ID: e573623625e9d5d2 | Stable ID: MjhmZWMzOT