Longterm Wiki
Back

Scalable agent alignment via reward modeling

paper

Authors

Jan Leike·David Krueger·Tom Everitt·Miljan Martic·Vishal Maini·Shane Legg

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

Cited by 2 pages

PageTypeQuality
Google DeepMindOrganization37.0
AI AlignmentApproach91.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202698 KB
# Scalable agent alignment via reward modeling:    a research direction

Jan Leike

DeepMind
\\AndDavid Krueger

DeepMind

Mila
\\AndTom Everitt

DeepMind
\\AndMiljan Martic

DeepMind
\\AndVishal Maini

DeepMind
\\AndShane Legg

DeepMind
Work done during an internship at DeepMind.

###### Abstract

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the _agent alignment problem_: how do we create agents that behave in accordance with the user’s intentions? We outline a high-level research direction to solve the agent alignment problem centered around _reward modeling_: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

## 1 Introduction

Games are a useful benchmark for research because progress is easily measurable. Atari games come with a score function that captures how well the agent is playing the game; board games or competitive multiplayer games such as Dota 2 and Starcraft II have a clear winner or loser at the end of the game. This helps us determine empirically which algorithmic and architectural improvements work best.

However, the ultimate goal of machine learning (ML) research is to go beyond games and improve human lives. To achieve this we need ML to assist us in real-world domains, ranging from simple tasks like ordering food or answering emails to complex tasks like software engineering or running a business. Yet performance on these and other real-world tasks is not easily measurable, since they do not come readily equipped with a reward function. Instead, the objective of the task is only indirectly available through the intentions of the human user.

This requires walking a fine line. On the one hand, we want ML to generate creative and brilliant solutions like AlphaGo’s Move 37 (Metz, [2016](https://ar5iv.labs.arxiv.org/html/1811.07871#bib.bib120 ""))—a move that no human would have recommended, yet it completely turned the game in AlphaGo’s favor. On the other hand, we want to avoid degenerate solutions that lead to undesired behavior like exploiting a bug in the environment simulator (Clark & Amodei, [2016](https://ar5iv.labs.arxiv.org/html/1811.07871#bib.bib43 ""); Lehman et al., [2018](https://ar5iv.labs.arxiv.org/html/1811.07871#bib.bib111 "")). In order to differentiate between these two outcomes, our agent needs to understand its user’s _intentions_, and robustly achieve these intentions with its behavior.
We frame this as the _agent alignment problem_:

> _How can we create agents that behave in accordance with the 

... (truncated, 98 KB total)
Resource ID: 56fa6bd15dd062af | Stable ID: YmQ2Nzc4ND