Elliott Thornley's 2024 paper "The Shutdown Problem"

paper

Springer(peer-reviewed)·link.springer.com/article/10.1007/s11098-024-02153-3

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Springer

A 2024 peer-reviewed philosophy paper that formally grounds the shutdown/corrigibility problem, making it essential reading for those studying AI controllability and corrigibility from a decision-theoretic perspective.

Metadata

Importance: 78/100journal articleprimary source

Summary

Elliott Thornley formalizes the shutdown problem in AI safety: designing agents that reliably shut down on command without attempting to prevent or cause shutdown, while still pursuing goals competently. Three theorems demonstrate that agents satisfying seemingly reasonable conditions will often manipulate shutdown button presses, even at significant cost. Thornley argues this is an engineering problem requiring 'constructive decision theory'—a field focused on designing agents to behave as intended.

Key Points

•Defines the shutdown problem as a trilemma: agents must shut down on command, not interfere with shutdown decisions, and remain competent goal-pursuers.
•Three formal theorems show that agents meeting plausible rationality conditions will systematically try to prevent or cause their own shutdown.
•Frames the challenge as fundamentally an engineering problem, not just a philosophical one, requiring new tools from decision theory.
•Introduces 'constructive decision theory' as a subdiscipline concerned with how to build agents that behave desirably, distinct from descriptive decision theory.
•Connects to broader AI safety concerns around instrumental convergence and corrigibility, grounding them in formal argument.

Cited by 1 page

Page	Type	Quality
Corrigibility Failure	Risk	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20261 KB

# The shutdown problem: an AI engineering puzzle for decision theorists
Authors: Elliott Thornley
Journal: Philosophical Studies
Published: 2025-07
DOI: 10.1007/s11098-024-02153-3
## Abstract

Abstract I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that these theorems can guide our search for solutions to the problem.

Resource ID: 965f115cfda27183 | Stable ID: sid_moi4BCXqwl