Debate as Scalable Oversight
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.
Cited by 9 pages
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| AI Accident Risk Cruxes | Crux | 67.0 |
| Paul Christiano | Person | 39.0 |
| AI-Assisted Alignment | Approach | 63.0 |
| AI Alignment | Approach | 91.0 |
| AI Safety via Debate | Approach | 70.0 |
| Scalable Oversight | Safety Agenda | 68.0 |
| Instrumental Convergence | Risk | 64.0 |
| Optimistic Alignment Worldview | Concept | 91.0 |
Cached Content Preview
\\newclass\\DEBATE
DEBATE
\\newclass\\QBFQBF
# AI safety via debate
Geoffrey Irving
Corresponding author: irving@openai.comPaul Christiano
OpenAI
Dario Amodei
###### Abstract
To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum _debate_ game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in \\PSPACE\\PSPACE\\PSPACE given polynomial time judges (direct judging answers only \\NP\\NP\\NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier’s accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.
## 1 Introduction
Learning to align an agent’s actions with the values and preferences of humans is a key challenge in ensuring that advanced AI systems remain safe \[Russell et al., [2016](https://ar5iv.labs.arxiv.org/html/1805.00899#bib.bib1 "")\]. Subtle problems in alignment can lead to unexpected and potentially unsafe behavior \[Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/1805.00899#bib.bib2 "")\], and we expect this problem to get worse as systems become more capable. Alignment is a training-time problem: it is difficult to retroactively fix the behavior and incentives of trained unaligned agents. Alignment likely requires interaction with humans during training, but care is required in choosing the precise form of the interaction as supervising the agent may itself be a challenging cognitive task.
For some tasks it is harder to bring behavior in line with human goals than for others. In simple cases, humans can directly demonstrate the behavior—this is the case of supervised learning or imitation learning, for example classifying an image or using a robotic gripper to pick up a block. For these tasks alignment with human preferences can in principle be achieved by imitating the human, and is implicit in existing ML approaches (although issues of bias in the training data still arise, see e.g. Mitchell and Shadlen \[ [2018](https://ar5iv.labs.arxiv.org/html/1805.0
... (truncated, 98 KB total)61da2f8e311a2bbf | Stable ID: OTU2YWU4MW