Chain-of-thought analysis

paper

2022·arXiv·arxiv.org/abs/2201.11903

Authors

Jason Wei·Xuezhi Wang·Dale Schuurmans·Maarten Bosma·Brian Ichter·Fei Xia·Ed Chi·Quoc Le·Denny Zhou

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational research demonstrating that chain-of-thought prompting significantly improves large language model reasoning capabilities, which is relevant to understanding AI capabilities, limitations, and potential safety implications of advanced reasoning in LLMs.

Paper Details

Citations

1205 influential

Year

1970

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Summary

This paper demonstrates that chain-of-thought (CoT) prompting—providing intermediate reasoning steps as examples—significantly enhances large language models' complex reasoning capabilities. By prompting models with just a few CoT demonstrations, the authors show substantial performance improvements across arithmetic, commonsense, and symbolic reasoning tasks. Notably, a 540B-parameter model with eight CoT exemplars achieves state-of-the-art results on GSM8K math word problems, outperforming finetuned GPT-3 with a verifier, suggesting that reasoning abilities emerge naturally in sufficiently large models through this simple prompting technique.

Cited by 5 pages

Page	Type	Quality
Reasoning and Planning	Capability	65.0
Mesa-Optimization Risk Analysis	Analysis	61.0
Capability Elicitation	Approach	91.0
Emergent Capabilities	Risk	61.0
Instrumental Convergence	Risk	64.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Chain-of-Thought Prompting Elicits Reasoning 
 in Large Language Models

 
 
 
Jason Wei        Xuezhi Wang        Dale Schuurmans        Maarten Bosma 
 
 Brian Ichter       Fei Xia       Ed H. Chi       Quoc V. Le       Denny Zhou 
 Google Research, Brain Team 
 
 {jasonwei,dennyzhou}@google.com 
 
 
 

 
 Abstract

 We explore how generating a chain of thought —a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning.
In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of-thought prompting , where a few chain of thought demonstrations are provided as exemplars in prompting.

 Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
The empirical gains can be striking.
For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

 
 
 Figure 1: 
Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks.
Chain-of-thought reasoning processes are highlighted.
 
 
 
 Math Word Problems (GSM8K) 0 0 20 20 20 40 40 40 60 60 60 80 80 80 100 100 100 33 33 33 55 55 55 18 18 18 57 57 57 Solve rate (%) Finetuned GPT-3 175B Prior best PaLM 540B: standard prompting PaLM 540B: chain-of-thought prompting 
 Figure 2: 
PaLM 540B uses chain-of-thought prompting to achieve new state-of-the-art performance on the GSM8K benchmark of math word problems.
Finetuned GPT-3 and prior best are from Cobbe et al. ( 2021 ) .
 
 
 
 
 1 Introduction

 
 The NLP landscape has recently been revolutionized by language models (Peters et al., 2018 ; Devlin et al., 2019 ; Brown et al., 2020 , inter alia ) .
Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., 2020 ; Brown et al., 2020 , inter alia ) .
However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021 ) .

 
 
 This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer.
Prior work has given models the ability to generate natural language intermediate steps by training from scratch (Ling et al., 2017 ) or finetuning a pretrained model (Cobbe et al., 2021 ) , in addition to neuro-symbolic methods that

... (truncated, 98 KB total)

Resource ID: 7d42a191f4b30946 | Stable ID: YmIwM2MwNj