Debate as Scalable Oversight

paper

2018·arXiv·arxiv.org/abs/1805.00899

Authors

Geoffrey Irving·Paul Christiano·Dario Amodei

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes debate as a scalable oversight mechanism where AI agents argue positions to help humans evaluate complex behaviors, addressing the challenge of judging AI safety and alignment in tasks too complex for direct human evaluation.

Paper Details

Citations

339

26 influential

Year

2018

arXiv:1805.00899 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

Summary

This paper proposes 'debate' as a scalable oversight mechanism for training AI systems on complex tasks that are difficult for humans to directly evaluate. Two agents compete in a zero-sum debate game, taking turns making statements about a question or proposed action, after which a human judge determines which agent provided more truthful and useful information. The authors draw an analogy to complexity theory, arguing that debate with optimal play can answer questions in PSPACE with polynomial-time judges (compared to NP for direct human judgment). They demonstrate initial results on MNIST classification where debate significantly improves classifier accuracy, and discuss theoretical implications and potential scaling challenges.

Cited by 9 pages

Page	Type	Quality
Long-Horizon Autonomous Tasks	Capability	65.0
AI Accident Risk Cruxes	Crux	67.0
Paul Christiano	Person	39.0
AI-Assisted Alignment	Approach	63.0
AI Alignment	Approach	91.0
AI Safety via Debate	Approach	70.0
Scalable Oversight	Research Area	68.0
Instrumental Convergence	Risk	64.0
Optimistic Alignment Worldview	Concept	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202686 KB

[1805.00899] AI safety via debate 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \newclass \DEBATE 
 DEBATE
 \newclass \QBF QBF

 
 AI safety via debate

 
 
 Geoffrey Irving
 Corresponding author: irving@openai.com 
    
 Paul Christiano 
 
 OpenAI
 
    
 Dario Amodei
 
 

 
 Abstract

 To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in \PSPACE \PSPACE \PSPACE given polynomial time judges (direct judging answers only \NP \NP \NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier’s accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

 
 
 
 1 Introduction

 
 Learning to align an agent’s actions with the values and preferences of humans is a key challenge in ensuring that advanced AI systems remain safe [Russell et al., 2016 ] . Subtle problems in alignment can lead to unexpected and potentially unsafe behavior [Amodei et al., 2016 ] , and we expect this problem to get worse as systems become more capable. Alignment is a training-time problem: it is difficult to retroactively fix the behavior and incentives of trained unaligned agents. Alignment likely requires interaction with humans during training, but care is required in choosing the precise form of the interaction as supervising the agent may itself be a challenging cognitive task.

 
 
 For some tasks it is harder to bring behavior in line with human goals than for others. In simple cases, humans can directly demonstrate the behavior—this is the case of supervised learning or imitation learning, for example classifying an image or using a robotic gripper to pick up a block. For these tasks alignment with human preferences can in principle be achieved by imitating the human, and is implicit in existing ML approaches (although issues of bias in the training data still arise, see e.g.  Mitchell and Shadlen [ 2018 ] ). Taking a step up in alignment difficulty, s

... (truncated, 86 KB total)

Resource ID: 61da2f8e311a2bbf | Stable ID: sid_1BBwauxc4c