OpenAI's GPT-4

paper

2023·arXiv·arxiv.org/abs/2303.08774

Authors

OpenAI·Josh Achiam·Steven Adler·Sandhini Agarwal·Lama Ahmad·Ilge Akkaya·Florencia Leoni Aleman·Diogo Almeida·Janko Altenschmidt·Sam Altman·Shyamal Anadkat·Red Avila·Igor Babuschkin·Suchir Balaji·Valerie Balcom·Paul Baltescu·Haiming Bao·Mohammad Bavarian·Jeff Belgum·Irwan Bello·Jake Berdine·Gabriel Bernadett-Shapiro·Christopher Berner·Lenny Bogdonoff·Oleg Boiko·Madelaine Boyd·Anna-Luisa Brakman·Greg Brockman·Tim Brooks·Miles Brundage·Kevin Button·Trevor Cai·Rosie Campbell·Andrew Cann·Brittany Carey·Chelsea Carlson·Rory Carmichael·Brooke Chan·Che Chang·Fotis Chantzis·Derek Chen·Sully Chen·Ruby Chen·Jason Chen·Mark Chen·Ben Chess·Chester Cho·Casey Chu·Hyung Won Chung·Dave Cummings·Jeremiah Currier·Yunxing Dai·Cory Decareaux·Thomas Degry·Noah Deutsch·Damien Deville·Arka Dhar·David Dohan·Steve Dowling·Sheila Dunning·Adrien Ecoffet·Atty Eleti·Tyna Eloundou·David Farhi·Liam Fedus·Niko Felix·Simón Posada Fishman·Juston Forte·Isabella Fulford·Leo Gao·Elie Georges·Christian Gibson·Vik Goel·Tarun Gogineni·Gabriel Goh·Rapha Gontijo-Lopes·Jonathan Gordon·Morgan Grafstein·Scott Gray·Ryan Greene·Joshua Gross·Shixiang Shane Gu·Yufei Guo·Chris Hallacy·Jesse Han·Jeff Harris·Yuchen He·Mike Heaton·Johannes Heidecke·Chris Hesse·Alan Hickey·Wade Hickey·Peter Hoeschele·Brandon Houghton·Kenny Hsu·Shengli Hu·Xin Hu·Joost Huizinga·Shantanu Jain·Shawn Jain·Joanne Jang·Angela Jiang·Roger Jiang·Haozhun Jin·Denny Jin·Shino Jomoto·Billie Jonn·Heewoo Jun·Tomer Kaftan·Łukasz Kaiser·Ali Kamali·Ingmar Kanitscheider·Nitish Shirish Keskar·Tabarak Khan·Logan Kilpatrick·Jong Wook Kim·Christina Kim·Yongjik Kim·Jan Hendrik Kirchner·Jamie Kiros·Matt Knight·Daniel Kokotajlo·Łukasz Kondraciuk·Andrew Kondrich·Aris Konstantinidis·Kyle Kosic·Gretchen Krueger·Vishal Kuo·Michael Lampe·Ikai Lan·Teddy Lee·Jan Leike·Jade Leung·Daniel Levy·Chak Ming Li·Rachel Lim·Molly Lin·Stephanie Lin·Mateusz Litwin·Theresa Lopez·Ryan Lowe·Patricia Lue·Anna Makanju·Kim Malfacini·Sam Manning·Todor Markov·Yaniv Markovski·Bianca Martin·Katie Mayer·Andrew Mayne·Bob McGrew·Scott Mayer McKinney·Christine McLeavey·Paul McMillan·Jake McNeil·David Medina·Aalok Mehta·Jacob Menick·Luke Metz·Andrey Mishchenko·Pamela Mishkin·Vinnie Monaco·Evan Morikawa·Daniel Mossing·Tong Mu·Mira Murati·Oleg Murk·David Mély·Ashvin Nair·Reiichiro Nakano·Rajeev Nayak·Arvind Neelakantan·Richard Ngo·Hyeonwoo Noh·Long Ouyang·Cullen O'Keefe·Jakub Pachocki·Alex Paino·Joe Palermo·Ashley Pantuliano·Giambattista Parascandolo·Joel Parish·Emy Parparita·Alex Passos·Mikhail Pavlov·Andrew Peng·Adam Perelman·Filipe de Avila Belbute Peres·Michael Petrov·Henrique Ponde de Oliveira Pinto·Michael·Pokorny·Michelle Pokrass·Vitchyr H. Pong·Tolly Powell·Alethea Power·Boris Power·Elizabeth Proehl·Raul Puri·Alec Radford·Jack Rae·Aditya Ramesh·Cameron Raymond·Francis Real·Kendra Rimbach·Carl Ross·Bob Rotsted·Henri Roussez·Nick Ryder·Mario Saltarelli·Ted Sanders·Shibani Santurkar·Girish Sastry·Heather Schmidt·David Schnurr·John Schulman·Daniel Selsam·Kyla Sheppard·Toki Sherbakov·Jessica Shieh·Sarah Shoker·Pranav Shyam·Szymon Sidor·Eric Sigler·Maddie Simens·Jordan Sitkin·Katarina Slama·Ian Sohl·Benjamin Sokolowsky·Yang Song·Natalie Staudacher·Felipe Petroski Such·Natalie Summers·Ilya Sutskever·Jie Tang·Nikolas Tezak·Madeleine B. Thompson·Phil Tillet·Amin Tootoonchian·Elizabeth Tseng·Preston Tuggle·Nick Turley·Jerry Tworek·Juan Felipe Cerón Uribe·Andrea Vallone·Arun Vijayvergiya·Chelsea Voss·Carroll Wainwright·Justin Jay Wang·Alvin Wang·Ben Wang·Jonathan Ward·Jason Wei·CJ Weinmann·Akila Welihinda·Peter Welinder·Jiayi Weng·Lilian Weng·Matt Wiethoff·Dave Willner·Clemens Winter·Samuel Wolrich·Hannah Wong·Lauren Workman·Sherwin Wu·Jeff Wu·Michael Wu·Kai Xiao·Tao Xu·Sarah Yoo·Kevin Yu·Qiming Yuan·Wojciech Zaremba·Rowan Zellers·Chong Zhang·Marvin Zhang·Shengjia Zhao·Tianhao Zheng·Juntang Zhuang·William Zhuk·Barret Zoph

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

OpenAI's technical report on GPT-4 detailing a multimodal large language model with human-level performance on benchmarks, relevant for understanding capabilities, limitations, and alignment approaches of advanced AI systems.

Paper Details

Citations

2916 influential

Year

2025

arXiv:2303.08774 DOI:10.2139/ssrn.5205937 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

Summary

OpenAI presents GPT-4, a large-scale multimodal model capable of processing both image and text inputs to generate text outputs. The model demonstrates human-level performance on professional and academic benchmarks, including achieving top 10% scores on simulated bar exams. Built on Transformer architecture with post-training alignment to improve factuality and behavioral adherence, GPT-4 represents advances in scaling infrastructure and predictive methods that enable performance estimation from models using 1/1000th of its computational resources.

Cited by 6 pages

Page	Type	Quality
Dense Transformers	Concept	58.0
Power-Seeking Emergence Conditions Model	Analysis	63.0
OpenAI	Organization	62.0
Sam Altman	Person	40.0
AI-Driven Concentration of Power	Risk	65.0
Deceptive Alignment	Risk	75.0

Cached Content Preview

HTTP 200Fetched May 30, 202698 KB

GPT-4 Technical Report

 
 
 OpenAI
 Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of the document. Correspondence regarding this technical report can be sent to gpt4-report@openai.com 
 

 
 Abstract

 We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
GPT-4 is a Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior.
A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based
on models trained with no more than 1/1,000th the compute of GPT-4. 

 
 
 
 1 Introduction

 
 This technical report presents GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs. Such models are an important area of study as they have the potential to be used in a wide range of applications, such as dialogue systems, text summarization, and machine translation. As such, they have been the subject of substantial interest and progress in recent years  [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 ] .

 
 
 One of the main goals of developing such models is to improve their ability to understand and generate natural language text, particularly in more complex and nuanced scenarios.
To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers. For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers. This contrasts with GPT-3.5, which scores in the bottom 10%.

 
 
 On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering). On the MMLU benchmark  [ 35 , 36 ] , an English-language suite of multiple-choice questions covering 57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4 surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these model capability results, as well as model safety improvements and results, in more detail in lat

... (truncated, 98 KB total)

Resource ID: 29a0882390ee7063 | Stable ID: sid_1pgfnunKly