Iterated Distillation and Amplification

paper

2018·arXiv·arxiv.org/abs/1810.08575

Authors

Paul Christiano·Buck Shlegeris·Dario Amodei

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.

Cited by 2 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Paul Christiano	Person	39.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202659 KB

# Supervising strong learners   by amplifying weak experts

Paul Christiano

OpenAI

paul@openai.com

&Buck Shlegeris

bshlegeris@gmail.com

&Dario Amodei

OpenAI

damodei@openai.com

Work done while at OpenAI.

###### Abstract

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., [2017](https://ar5iv.labs.arxiv.org/html/1810.08575#bib.bib4 ""); Silver et al., [2017b](https://ar5iv.labs.arxiv.org/html/1810.08575#bib.bib22 "")), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.

## 1 Introduction

If we want to train an ML system to perform a task, we need to be able to evaluate how well it is doing. Whether our training signal takes the form of labels, rewards, or something else entirely, we need some way to generate that signal.

If our goal can be evaluated automatically,
such as winning a game of Go,
or if we have an algorithm that can generate examples of correct behavior, then generating a training signal is trivial.
In these cases we might say that there is an “algorithmic” training signal.

Unfortunately, most useful tasks don’t have an algorithmic training signal.
So in current applications of machine learning, humans often provide the training signal.
This can be done by having a human demonstrate the task, for example labeling an image or teleoperating a robot,
or by learning a reward function from human judgments.
For these classes of tasks, we could say there is a “human” training signal.

However, there are harder tasks for which we can’t compute demonstrations or rewards even with human assistance,
and for which we currently have no clear method to get a meaningful training signal.
Consider making economic policy decisions,
advancing the scientific frontier,
or managing the security of a large network of computers.
Some of these tasks are “beyond human scale” – a single human
can’t perform them and can’t make sense of their massive observation space well enough to judge the behavior of an agent.
It may be possible for a human to judge performance in the very long run
(for example, by looking at economic growth over several years),
but such long-term feedback is very slow to learn from.
We currently have no way to learn how to perform such tasks much better than a human.

The overall situation is depicted in Table 1, which shows six different combinations o

... (truncated, 59 KB total)

Resource ID: f0980ca7010a4a44 | Stable ID: YmI4MTA3Mj