Longterm Wiki
Back

[1706.03741] Deep Reinforcement Learning from Human Preferences

paper

Data Status

Not fetched

Cited by 2 pages

PageTypeQuality
Why Alignment Might Be HardArgument69.0
Reward ModelingApproach55.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202663 KB
[1706.03741] Deep Reinforcement Learning from Human Preferences 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Deep Reinforcement Learning 
 from Human Preferences

 
 
 
Paul F Christiano 
 OpenAI 
 paul@openai.com 
 &Jan Leike 
 DeepMind 
 leike@google.com 
 &Tom B Brown 
 nottombrown@gmail.com 
 &Miljan Martic 
 DeepMind 
 miljanm@google.com 
 &Shane Legg 
 DeepMind 
 legg@google.com 
 &Dario Amodei 
 OpenAI 
 damodei@openai.com 
 
 
 

 
 Abstract

 For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments,
we need to communicate complex goals to these systems.
In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
We show that this approach can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion,
while providing feedback on less than 1% of our agent’s interactions with the environment.
This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
To demonstrate the flexibility of our approach,
we show that we can successfully train complex novel behaviors
with about an hour of human time.
These behaviors and environments are considerably more complex
than any which have been previously learned from human feedback.

 
 
 
 1 Introduction

 
 Recent success in scaling reinforcement learning (RL) to large problems
has been driven in domains that have a well-specified reward function  (Mnih et al., 2015 , 2016 ; Silver et al., 2016 ) .
Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify.
Overcoming this limitation would greatly expand the possible impact of deep RL
and could increase the reach of machine learning more broadly.

 
 
 For example, suppose that we wanted to use reinforcement learning
to train a robot to clean a table or scramble an egg.
It’s not clear how to construct a suitable reward function,
which will need to be a function of the robot’s sensors.
We could try to design a simple reward function that approximately captures the intended behavior,
but this will often result in behavior that optimizes our reward function without actually
satisfying our preferences.
This difficulty underlies recent concerns about misalignment
between our values and the objectives of our RL systems  (Bostrom, 2014 ; Russell, 2016 ; Amodei et al., 2016 ) .
If we could successfully communicate our actual objectives to our agents,
it would be a significant step towards addressing these concerns.

 
 
 If we have demonstrations of the desired task,
we can extract a reward function using inverse reinforcement learning  (Ng and Russell, 2000 ) .
This reward function can then be used to train an agent with reinforcement learning.
More directly, we can use imitation learning
to clone the demonstrated behavior.
However, these approaches are not directly applicable
to behaviors that are

... (truncated, 63 KB total)
Resource ID: 14df73723b4d14d7 | Stable ID: YjZhZWZlOD