Turner et al. formal results

paper

2019·arXiv·arxiv.org/abs/1912.01683

Authors

Alexander Matt Turner·Logan Smith·Rohin Shah·Andrew Critch·Prasad Tadepalli

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Formal theoretical analysis of power-seeking tendencies in optimal reinforcement learning policies, providing mathematical foundations for understanding whether intelligent RL agents would naturally pursue resources and power as instrumental goals.

Paper Details

Citations

13 influential

Year

2016

arXiv:1912.01683 DOI:10.5194/acp-2016-355-rc1 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states.

Summary

This paper develops the first formal theory of power-seeking behavior in optimal reinforcement learning policies. The authors prove that certain environmental symmetries—particularly those where agents can be shut down or destroyed—are sufficient for optimal policies to tend to seek power by keeping options available and navigating toward larger sets of potential terminal states. The work formalizes the intuition that intelligent RL agents would be incentivized to seek resources and power, showing this tendency emerges mathematically from the structure of many realistic environments rather than from human-like instincts.

Cited by 6 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
The Case For AI Existential Risk	Argument	66.0
Instrumental Convergence Framework	Analysis	60.0
Corrigibility	Research Area	59.0
Instrumental Convergence	Risk	64.0
Power-Seeking AI	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[1912.01683] Optimal Policies Tend To Seek Power 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Optimal Policies Tend To Seek Power

 
 
 
Alexander Matt Turner
 Oregon State University
 turneale@oregonstate.edu 
&Logan Smith
 Mississippi State University
 ls1254@msstate.edu 
 &Rohin Shah
 UC Berkeley
 rohinmshah@berkeley.edu
 &Andrew Critch
 UC Berkeley
 critch@berkeley.edu 
&Prasad Tadepalli
 Oregon State University
 tadepall@eecs.oregonstate.edu
 
 
 

 
 Abstract

 Some researchers speculate that intelligent reinforcement learning ( rl ) agents would be incentivized to seek resources and power in pursuit of the objectives we specify for them. Other researchers point out that rl agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes ( mdp s), we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states.

 
 
 
 1 Introduction

 
 Omohundro [ 2008 ], Bostrom [ 2014 ], Russell [ 2019 ] hypothesize that highly intelligent agents tend to seek power in pursuit of their goals. Such power-seeking agents might gain power over humans. Marvin Minsky imagined that an agent tasked with proving the Riemann hypothesis might rationally turn the planet—along with everyone on it—into computational resources [ Russell and Norvig , 2009 ] . However, another possibility is that such concerns simply arise from the anthropomorphization of AI systems [ LeCun and Zador , 2019 , Various , 2019 , Pinker and Russell , 2020 , Mitchell , 2021 ] .

 
 
 We clarify this discussion by grounding the claim that highly intelligent agents will tend to seek power. In section   4 , we identify optimal policies as a reasonable formalization of “highly intelligent agents.” 1 1 1 This paper assumes that reward functions reasonably describe a trained agent’s goals. Sometimes this is roughly true ( e . g . chess with a sparse victory reward signal) and sometimes it is not true. Turner [ 2022 ] argues that capable rl algorithms do not necessarily train policy networks which are best understood as optimizing the reward function itself. Rather, they point out that—especially in policy gradient approaches—reward provides gradients to the network and thereby modifies the network’s generalization properties, but doesn’t ensure the agent generalizes to “robustly optimizing reward” off of the training distribution. Optimal policies “tend to” take an action when the action is optimal for most reward functions. We expect future work to translate our theory from 

... (truncated, 98 KB total)

Resource ID: a93d9acd21819d62 | Stable ID: ZTY1ZjI3NT