Brown et al. (2020)

paper

2020·arXiv·arxiv.org/abs/2005.14165

Authors

Tom B. Brown·Benjamin Mann·Nick Ryder·Melanie Subbiah·Jared Kaplan·Prafulla Dhariwal·Arvind Neelakantan·Pranav Shyam·Girish Sastry·Amanda Askell·Sandhini Agarwal·Ariel Herbert-Voss·Gretchen Krueger·Tom Henighan·Rewon Child·Aditya Ramesh·Daniel M. Ziegler·Jeffrey Wu·Clemens Winter·Christopher Hesse·Mark Chen·Eric Sigler·Mateusz Litwin·Scott Gray·Benjamin Chess·Jack Clark·Christopher Berner·Sam McCandlish·Alec Radford·Ilya Sutskever·Dario Amodei

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Brown et al.'s GPT-3 paper demonstrates few-shot learning capabilities in large language models, which is foundational to understanding AI capabilities, alignment challenges, and the emergence of unexpected behaviors in large-scale language models relevant to AI safety research.

Paper Details

Citations

5077 influential

Year

2020

arXiv:2005.14165 DOI:10.1017/s0033291725102882.sm001 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Summary

Brown et al. (2020) introduce GPT-3, a 175-billion-parameter autoregressive language model that demonstrates strong few-shot learning capabilities without task-specific fine-tuning. By scaling up language model size by 10x compared to previous non-sparse models, GPT-3 achieves competitive performance on diverse NLP tasks including translation, question-answering, reasoning, and arithmetic through text-based prompting alone. The paper shows that language model scale enables task-agnostic performance approaching human-like few-shot learning, while also identifying limitations and societal concerns, including the model's ability to generate human-indistinguishable news articles.

Cited by 5 pages

Page	Type	Quality
State-Space Models / Mamba	Capability	54.0
Mesa-Optimization Risk Analysis	Analysis	61.0
OpenAI	Organization	62.0
AI-Driven Concentration of Power	Risk	65.0
Emergent Capabilities	Risk	61.0

1 FactBase fact citing this source

Entity	Property	Value	As Of
OpenAI	Model Parameters	175 billion	Jun 2020

Cached Content Preview

HTTP 200Fetched Apr 24, 202698 KB

Language Models are Few-Shot Learners

 
 
 
Tom B. Brown
Benjamin Mann 1 1 footnotemark: 1 
Nick Ryder 1 1 footnotemark: 1 
Melanie Subbiah 1 1 footnotemark: 1 
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
Tom Henighan
Rewon Child
Aditya Ramesh
Daniel M. Ziegler
Jeffrey Wu
Clemens Winter
Christopher Hesse
Mark Chen
Eric Sigler
Mateusz Litwin
Scott Gray
Benjamin Chess
Jack Clark
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei

 OpenAI 
 Equal contributionJohns Hopkins University, OpenAI
 
 Author contributions listed at end of paper . 
 
 (2020) 

 
 Abstract

 Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

 
 

 
 
 
 1 Introduction

 
 Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [ 82 , 102 ] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [ 24 , 81 , 100 ] (though still applied to task-specific architectures),

... (truncated, 98 KB total)

Resource ID: 2cab3ea10b8b7ae2 | Stable ID: sid_8bUZn8tsWV