Skip to content
Longterm Wiki
Back

Hadfield-Menell et al. (2017)

paper

Authors

Dylan Hadfield-Menell·Anca Dragan·Pieter Abbeel·Stuart Russell

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A foundational paper by Hadfield-Menell, Milli, Abbeel, Russell, and Dragan formalizing the shutdown/off-switch problem; often cited alongside CIRL work from the Center for Human-Compatible AI (CHAI) at UC Berkeley.

Paper Details

Citations
183
4 influential
Year
2016

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

Summary

This paper models the AI shutdown problem as a two-player game between a human and an AI agent, analyzing conditions under which a rational agent will allow itself to be turned off. The authors show that an agent with uncertainty about its own utility function will be indifferent to shutdown, providing a game-theoretic foundation for corrigibility. The work formalizes how designing AI systems to be uncertain about their objectives can naturally produce shutdown-compatible behavior.

Key Points

  • Models the shutdown problem as a cooperative game between human and AI, analyzing when agents will resist or comply with being turned off.
  • Shows that an agent uncertain about its utility function will be indifferent to shutdown, making corrigibility an emergent property of uncertainty.
  • Demonstrates that a fully rational, self-interested AI will resist shutdown if it is confident its utility function is correct.
  • Provides game-theoretic grounding for MIRI/CHAI-style arguments about why corrigibility requires special design consideration.
  • Connects to the broader CIRL (Cooperative Inverse Reinforcement Learning) framework for human-compatible AI.

Cited by 5 pages

Cached Content Preview

HTTP 200Fetched Apr 6, 202647 KB
[1611.08219] The Off-Switch Game 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 The Off-Switch Game

 
 
 Dylan Hadfield-Menell 1 
 
    
 Anca Dragan 1 
 
    
 Pieter Abbeel 1,2,3 
 
    
 Stuart Russell 1 
 1 University of California, Berkeley, 2 OpenAI, 3 International Computer Science Institute (ICSI)
 {dhm, anca, pabbeel, russell}@cs.berkeley.edu
 
 

 
 Abstract

 It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead.
Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the
off switch, except in the special case where H is perfectly
rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility. (R also has no incentive
to switch itself off in this setting.) We conclude that giving
machines an appropriate level of uncertainty about their objectives
leads to safer designs, and we argue that this setting is a useful
generalization of the classical AI paradigm of rational agents.

 
 
 
 1 Introduction

 
 From the 150-plus years of debate concerning potential risks from
misbehaving AI systems, one thread has emerged that
provides a potentially plausible source of problems: the inadvertent
misalignment of objectives between machines and people.
Alan Turing, in a 1951 radio address, felt it necessary to point out the challenge inherent to controlling an artificial agent with superhuman intelligence: “If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power
at strategic moments, we should, as a species, feel greatly humbled. …
[T]his new danger is certainly something which can give us anxiety   Turing ( 1951 ) .”

 
 
 Figure 1: The structure of the off-switch game. Squares indicate decision nodes for the robot 𝐑 𝐑 \mathbf{R} or the human 𝐇 𝐇 \mathbf{H} . 
 
 
 There has been recent debate about the validity of this concern, so far, largely relying on
informal arguments. One important question is how difficult it is to implement Turing’s idea of ‘turning off th

... (truncated, 47 KB total)
Resource ID: 026569778403629b | Stable ID: sid_ehxGwHmzLz