Training Language Models to Follow Instructions with Human Feedback

paper

2022·arXiv·arxiv.org/abs/2203.02155

Authors

Long Ouyang·Jeff Wu·Xu Jiang·Diogo Almeida·Carroll L. Wainwright·Pamela Mishkin·Chong Zhang·Sandhini Agarwal·Katarina Slama·Alex Ray·John Schulman·Jacob Hilton·Fraser Kelton·Luke Miller·Maddie Simens·Amanda Askell·Peter Welinder·Paul Christiano·Jan Leike·Ryan Lowe

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This is the seminal InstructGPT paper from OpenAI that popularized RLHF as the dominant alignment training paradigm; it directly underpins ChatGPT and is essential reading for anyone studying LLM alignment techniques.

Paper Details

Citations

19,177

2092 influential

Year

2022

arXiv:2203.02155 DOI:10.52202/068431-2011 Semantic Scholar

Metadata

Importance: 95/100arxiv preprintprimary source

Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Summary

This paper introduces InstructGPT, a method for aligning language models with human intent using Reinforcement Learning from Human Feedback (RLHF). By fine-tuning GPT-3 with human preference data, the authors demonstrate that smaller aligned models can outperform much larger unaligned models on user-preferred outputs. The work establishes RLHF as a foundational technique for making LLMs safer and more helpful.

Key Points

•Introduces RLHF pipeline: supervised fine-tuning on demonstrations, reward model training on human preferences, then PPO optimization
•1.3B InstructGPT outputs preferred by human raters over 175B GPT-3 outputs, showing alignment can improve capability perception
•Aligned models show reductions in harmful outputs and improved truthfulness compared to base GPT-3
•Reveals 'alignment tax' concern but shows it can be mitigated by mixing pre-training data during RLHF fine-tuning
•Foundational empirical paper underpinning ChatGPT, Claude, and most modern aligned LLMs

Cited by 8 pages

Page	Type	Quality
Dense Transformers	Concept	58.0
AI Safety Defense in Depth Model	Analysis	69.0
AI Safety Intervention Effectiveness Matrix	Analysis	73.0
OpenAI	Organization	62.0
AI Alignment	Approach	91.0
Reward Modeling	Approach	55.0
RLHF	Research Area	63.0
Optimistic Alignment Worldview	Concept	91.0

Cached Content Preview

HTTP 200Fetched Apr 24, 202698 KB

\xpatchcmd \sv@part 
 Part 

 
 
 #2. #2 

 
 Training language models to follow instructions
 with human feedback

 
 
 Long Ouyang
&Jeff Wu ∗ 
&Xu Jiang ∗ 
&Diogo Almeida ∗ 
&Carroll L. Wainwright ∗ 
&Pamela Mishkin ∗ 
&Chong Zhang
&Sandhini Agarwal
&Katarina Slama
&Alex Ray
&John Schulman
&Jacob Hilton
&Fraser Kelton
&Luke Miller
&Maddie Simens
&Amanda Askell
&Peter Welinder
&Paul Christiano ∗† 
Jan Leike ∗ 
&Ryan Lowe ∗ 
OpenAI
 Primary authors. This was a joint project of the OpenAI Alignment team. RL and JL are the team leads. Corresponding author: lowe@openai.com .
Work done while at OpenAI. Current affiliations: AA: Anthropic; PC: Alignment Research Center. 
 

 
 Abstract

 Making language models bigger does not inherently make them better at following a user’s intent.
For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user.
In other words, these models are not aligned with their users.
In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning.
We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback.
We call the resulting models InstructGPT .
In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

 
 
 Figure 1 : Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model.
Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals. 
 
 
 
 1 Introduction

 
 Large language models (LMs) can be “prompted” to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions  (Bender et al.,, 2021 ; Bommasani et al.,, 2021 ; Kenton et al.,, 2021 ; Weidinger et al.,, 2021 ;

... (truncated, 98 KB total)

Resource ID: 1098fc60be7ca2b0 | Stable ID: sid_PAyv52HUgJ