Back
[1706.03741] Deep Reinforcement Learning from Human Preferences
paperarxiv.org·arxiv.org/abs/1706.03741
Data Status
Not fetched
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
| Reward Modeling | Approach | 55.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202663 KB
[1706.03741] Deep Reinforcement Learning from Human Preferences
Deep Reinforcement Learning
from Human Preferences
Paul F Christiano
OpenAI
paul@openai.com
&Jan Leike
DeepMind
leike@google.com
&Tom B Brown
nottombrown@gmail.com
&Miljan Martic
DeepMind
miljanm@google.com
&Shane Legg
DeepMind
legg@google.com
&Dario Amodei
OpenAI
damodei@openai.com
Abstract
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments,
we need to communicate complex goals to these systems.
In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.
We show that this approach can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion,
while providing feedback on less than 1% of our agent’s interactions with the environment.
This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
To demonstrate the flexibility of our approach,
we show that we can successfully train complex novel behaviors
with about an hour of human time.
These behaviors and environments are considerably more complex
than any which have been previously learned from human feedback.
1 Introduction
Recent success in scaling reinforcement learning (RL) to large problems
has been driven in domains that have a well-specified reward function (Mnih et al., 2015 , 2016 ; Silver et al., 2016 ) .
Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify.
Overcoming this limitation would greatly expand the possible impact of deep RL
and could increase the reach of machine learning more broadly.
For example, suppose that we wanted to use reinforcement learning
to train a robot to clean a table or scramble an egg.
It’s not clear how to construct a suitable reward function,
which will need to be a function of the robot’s sensors.
We could try to design a simple reward function that approximately captures the intended behavior,
but this will often result in behavior that optimizes our reward function without actually
satisfying our preferences.
This difficulty underlies recent concerns about misalignment
between our values and the objectives of our RL systems (Bostrom, 2014 ; Russell, 2016 ; Amodei et al., 2016 ) .
If we could successfully communicate our actual objectives to our agents,
it would be a significant step towards addressing these concerns.
If we have demonstrations of the desired task,
we can extract a reward function using inverse reinforcement learning (Ng and Russell, 2000 ) .
This reward function can then be used to train an agent with reinforcement learning.
More directly, we can use imitation learning
to clone the demonstrated behavior.
However, these approaches are not directly applicable
to behaviors that are
... (truncated, 63 KB total)Resource ID:
14df73723b4d14d7 | Stable ID: YjZhZWZlOD