Brown et al. (2020)

paper

2020·arXiv·arxiv.org/abs/2005.14165

Authors

Tom B. Brown·Benjamin Mann·Nick Ryder·Melanie Subbiah·Jared Kaplan·Prafulla Dhariwal·Arvind Neelakantan·Pranav Shyam·Girish Sastry·Amanda Askell·Sandhini Agarwal·Ariel Herbert-Voss·Gretchen Krueger·Tom Henighan·Rewon Child·Aditya Ramesh·Daniel M. Ziegler·Jeffrey Wu·Clemens Winter·Christopher Hesse·Mark Chen·Eric Sigler·Mateusz Litwin·Scott Gray·Benjamin Chess·Jack Clark·Christopher Berner·Sam McCandlish·Alec Radford·Ilya Sutskever·Dario Amodei

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Cited by 5 pages

Page	Type	Quality
State-Space Models / Mamba	Capability	54.0
Mesa-Optimization Risk Analysis	Analysis	61.0
OpenAI	Organization	62.0
AI-Driven Concentration of Power	Risk	65.0
Emergent Capabilities	Risk	61.0

Cached Content Preview

HTTP 200Fetched Feb 22, 20267 KB

[2005.14165] Language Models are Few-Shot Learners 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 Happy Open Access Week from arXiv!

 YOU make open access possible! Tell us why you support #openaccess and give to arXiv this week to help keep science open for all.

 
 
 Donate! 
 
 
 
 
 
 

 
 
 
 
 
--> 

 
 
 Computer Science > Computation and Language

 

 
 arXiv:2005.14165 (cs)
 
 
 
 
 
 [Submitted on 28 May 2020 ( v1 ), last revised 22 Jul 2020 (this version, v4)] 
 Title: Language Models are Few-Shot Learners

 Authors: Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel M. Ziegler , Jeffrey Wu , Clemens Winter , Christopher Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , Dario Amodei View a PDF of the paper titled Language Models are Few-Shot Learners, by Tom B. Brown and 30 other authors 
 View PDF 

 
 Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3&#39;s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of

... (truncated, 7 KB total)

Resource ID: 2cab3ea10b8b7ae2 | Stable ID: MGZkNjA2Mj