Brown et al. (2020)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Brown et al.'s GPT-3 paper demonstrates few-shot learning capabilities in large language models, which is foundational to understanding AI capabilities, alignment challenges, and the emergence of unexpected behaviors in large-scale language models relevant to AI safety research.
Paper Details
Metadata
Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Summary
Brown et al. (2020) introduce GPT-3, a 175-billion-parameter autoregressive language model that demonstrates strong few-shot learning capabilities without task-specific fine-tuning. By scaling up language model size by 10x compared to previous non-sparse models, GPT-3 achieves competitive performance on diverse NLP tasks including translation, question-answering, reasoning, and arithmetic through text-based prompting alone. The paper shows that language model scale enables task-agnostic performance approaching human-like few-shot learning, while also identifying limitations and societal concerns, including the model's ability to generate human-indistinguishable news articles.
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| State-Space Models / Mamba | Capability | 54.0 |
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| OpenAI | Organization | 62.0 |
| AI-Driven Concentration of Power | Risk | 65.0 |
| Emergent Capabilities | Risk | 61.0 |
1 FactBase fact citing this source
| Entity | Property | Value | As Of |
|---|---|---|---|
| OpenAI | Model Parameters | 175 billion | Jun 2020 |
Cached Content Preview
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann 1 1 footnotemark: 1
Nick Ryder 1 1 footnotemark: 1
Melanie Subbiah 1 1 footnotemark: 1
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
Tom Henighan
Rewon Child
Aditya Ramesh
Daniel M. Ziegler
Jeffrey Wu
Clemens Winter
Christopher Hesse
Mark Chen
Eric Sigler
Mateusz Litwin
Scott Gray
Benjamin Chess
Jack Clark
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
OpenAI
Equal contributionJohns Hopkins University, OpenAI
Author contributions listed at end of paper .
(2020)
Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
1 Introduction
Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [ 82 , 102 ] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [ 24 , 81 , 100 ] (though still applied to task-specific architectures),
... (truncated, 98 KB total)2cab3ea10b8b7ae2 | Stable ID: sid_8bUZn8tsWV