OpenAI's GPT-4
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
OpenAI's technical report on GPT-4 detailing a multimodal large language model with human-level performance on benchmarks, relevant for understanding capabilities, limitations, and alignment approaches of advanced AI systems.
Paper Details
Metadata
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
Summary
OpenAI presents GPT-4, a large-scale multimodal model capable of processing both image and text inputs to generate text outputs. The model demonstrates human-level performance on professional and academic benchmarks, including achieving top 10% scores on simulated bar exams. Built on Transformer architecture with post-training alignment to improve factuality and behavioral adherence, GPT-4 represents advances in scaling infrastructure and predictive methods that enable performance estimation from models using 1/1000th of its computational resources.
Cited by 6 pages
| Page | Type | Quality |
|---|---|---|
| Dense Transformers | Concept | 58.0 |
| Power-Seeking Emergence Conditions Model | Analysis | 63.0 |
| OpenAI | Organization | 62.0 |
| Sam Altman | Person | 40.0 |
| AI-Driven Concentration of Power | Risk | 65.0 |
| Deceptive Alignment | Risk | 75.0 |
Cached Content Preview
GPT-4 Technical Report
OpenAI
Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of the document. Correspondence regarding this technical report can be sent to gpt4-report@openai.com
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
GPT-4 is a Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior.
A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based
on models trained with no more than 1/1,000th the compute of GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs. Such models are an important area of study as they have the potential to be used in a wide range of applications, such as dialogue systems, text summarization, and machine translation. As such, they have been the subject of substantial interest and progress in recent years [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 ] .
One of the main goals of developing such models is to improve their ability to understand and generate natural language text, particularly in more complex and nuanced scenarios.
To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers. For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers. This contrasts with GPT-3.5, which scores in the bottom 10%.
On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering). On the MMLU benchmark [ 35 , 36 ] , an English-language suite of multiple-choice questions covering 57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4 surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these model capability results, as well as model safety improvements and results, in more detail in lat
... (truncated, 98 KB total)29a0882390ee7063 | Stable ID: sid_1pgfnunKly