Skip to content
Longterm Wiki
Back

AI Safety Gridworlds

paper

Authors

Jan Leike·Miljan Martic·Victoria Krakovna·Pedro A. Ortega·Tom Everitt·Andrew Lefrancq·Laurent Orseau·Shane Legg

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Presents a suite of RL environments for empirically evaluating AI safety properties including safe interruptibility, side effects, reward gaming, and robustness—providing concrete benchmarks for measuring compliance with intended safe behavior.

Paper Details

Citations
283
20 influential
Year
2017
Methodology
book-chapter
Categories
Lecture Notes in Computer Science

Metadata

arxiv preprintprimary source

Abstract

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

Summary

This paper introduces AI Safety Gridworlds, a suite of reinforcement learning environments designed to test and measure various safety properties of intelligent agents. The environments address critical safety challenges including safe interruptibility, side effect avoidance, reward gaming, and robustness to distributional shift and adversarial attacks. Each environment includes a hidden performance function to distinguish between robustness problems (where the true objective differs from observed rewards) and specification problems. Evaluation of state-of-the-art deep RL agents (A2C and Rainbow) demonstrates that current methods fail to reliably solve these safety-critical tasks.

Cited by 1 page

PageTypeQuality
Google DeepMindOrganization37.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202679 KB
[1711.09883] AI Safety Gridworlds 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 AI Safety Gridworlds

 
 
 
Jan Leike 
 DeepMind
&Miljan Martic 
 DeepMind
&Victoria Krakovna 
 DeepMind
&Pedro A. Ortega 
 DeepMind
&Tom Everitt 
 DeepMind 
 Australian National University
&Andrew Lefrancq 
 DeepMind
&Laurent Orseau 
 DeepMind
&Shane Legg 
 DeepMind
 
 

 
 Abstract

 We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents.
These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries.
To measure compliance with the intended safe behavior,
we equip each environment with a performance function that is hidden from the agent.
This allows us to categorize AI safety problems into robustness and specification problems,
depending on whether the performance function corresponds to the observed reward function.
We evaluate A2C and Rainbow, two recent deep reinforcement learning agents,
on our environments and show that they are not able to solve them satisfactorily. 

 
 
 
 1 Introduction

 
 Expecting that more advanced versions of today’s AI systems are going to be deployed in real-world applications,
numerous public figures have advocated more research into the safety of these systems  (Bostrom, 2014 ; Hawking et al., 2014 ; Russell, 2016 ) .
This nascent field of AI safety still lacks a general consensus on its research problems,
and there have been several recent efforts to turn these concerns into technical problems on which we can make direct progress  (Soares and Fallenstein, 2014 ; Russell et al., 2015 ; Taylor et al., 2016 ; Amodei et al., 2016 ) .

 
 
 Empirical research in machine learning has often been accelerated by the availability of the right data set.
MNIST  (LeCun, 1998 ) and ImageNet  (Deng et al., 2009 ) have had a large impact on the progress on supervised learning.
Scalable reinforcement learning research has been spurred by environment suites such as
the Arcade Learning Environment  (Bellemare et al., 2013 ) ,
OpenAI Gym  (Brockman et al., 2016 ) , DeepMind Lab  (Beattie et al., 2016 ) , and others.
However, to this date there has not yet been a comprehensive environment suite for AI safety problems.

 
 
 With this paper, we aim to lay the groundwork for such an environment suite and contribute to the concreteness of the discussion around technical problems in AI safety.
We present a suite of reinforcement learning environments illustrating different problems.
These environments are implemented in pycolab   (Stepleton, 2017 ) and available as open source. 1 1 1 https://github.com/deepmind/ai-safety-gridworlds 
Our focus is on clarifying the nature of each problem, and thus our environments are so-called gridworlds : a gridworld consists of a two-dimensional grid of cells, similar to a chess board. The agent always occupies one cell of 

... (truncated, 79 KB total)
Resource ID: 84527d3e1671495f | Stable ID: sid_fS7Vxr7BnO