Longterm Wiki
Back

Hadfield-Menell et al. (2017)

paper

Authors

Dylan Hadfield-Menell·Anca Dragan·Pieter Abbeel·Stuart Russell

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

Cited by 5 pages

Cached Content Preview

HTTP 200Fetched Feb 22, 20265 KB
[1611.08219] The Off-Switch Game 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 
 
 
 
 
--> 

 
 
 Computer Science > Artificial Intelligence

 

 
 arXiv:1611.08219 (cs)
 
 
 
 
 
 [Submitted on 24 Nov 2016 ( v1 ), last revised 16 Jun 2017 (this version, v3)] 
 Title: The Off-Switch Game

 Authors: Dylan Hadfield-Menell , Anca Dragan , Pieter Abbeel , Stuart Russell View a PDF of the paper titled The Off-Switch Game, by Dylan Hadfield-Menell and 3 other authors 
 View PDF 

 
 Abstract: It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.
 

 
 
 
 Subjects: 
 
 Artificial Intelligence (cs.AI) 
 
 Cite as: 
 arXiv:1611.08219 [cs.AI] 
 
 
 
 (or 
 arXiv:1611.08219v3 [cs.AI] for this version)
 
 
 
 
 https://doi.org/10.48550/arXiv.1611.08219 
 
 
 Focus to learn more 
 
 
 
 arXiv-issued DOI via DataCite 
 
 
 
 
 
 
 
 Submission history

 From: Dylan Hadfield-Menell [ view email ] 
 [v1] 
 Thu, 24 Nov 2016 15:23:48 UTC (303 KB)

 [v2] 
 Thu, 25 May 2017 17:05:16 UTC (666 KB)

 [v3] 
 Fri, 16 Jun 2017 01:41:59 UTC (668 KB)

 
 
 
 
 
 Full-text links: 
 Access Paper:

 
 
View a PDF of the paper titled The Off-Switch Game, by Dylan Hadfield-Menell and 3 other authors View PDF 
 TeX Source
 
 
 view license 
 
 
 Current browse context: cs.AI 

 
 
 < prev 
 
 | 
 next > 
 

 
 new 
 | 
 recent 
 | 2016-11 
 
 Change to browse by:
 
 cs 
 
 

 
 
 References & Citations

 
 NASA ADS 
 Google Scholar 

 Semantic Scholar 

 
 
 

 
 
 2 blog links 

 ( what is this? )
 
 
 
 DBLP - CS Bibliogr

... (truncated, 5 KB total)
Resource ID: 026569778403629b | Stable ID: NTQ2ZGZhM2