Longterm Wiki
Back

Data Status

Not fetched

Cited by 8 pages

PageTypeQuality
Alignment Research CenterOrganization57.0
METROrganization66.0
OpenAIOrganization62.0
AI Governance and PolicyCrux66.0
Red TeamingApproach65.0
Sandboxing / ContainmentApproach91.0
Emergent CapabilitiesRisk61.0
AI ProliferationRisk60.0

Cached Content Preview

HTTP 200Fetched Feb 27, 202698 KB
# GPT-4 System Card

OpenAI

**Abstract**

Large language models (LLMs) are being deployed in many domains of our lives ranging from browsing, to voice assistants, to coding assistance tools, and have potential for vast societal impacts.\[1, 2, 3, 4, 5, 6, 7\] This system card analyzes GPT-4, the latest LLM in the GPT family of models.\[8, 9, 10\] First, we highlight safety challenges presented by the model’s limitations (e.g., producing convincing text that is subtly false) and capabilities (e.g., increased adeptness at providing illicit advice, performance in dual-use capabilities, and risky emergent behaviors). Second, we give a high-level overview of the safety processes OpenAI adopted to prepare GPT-4 for deployment. This spans our work across measurements, model-level changes, product- and system-level interventions (such as monitoring and policies), and external expert engagement. Finally, we demonstrate that while our mitigations and processes alter GPT-4’s behavior and prevent certain kinds of misuses, they are limited and remain brittle in some cases. This points to the need for anticipatory planning and governance.\[11\]

**Content Warning: This document contains content that some may find disturbing or offensive,** including content that is sexual, hateful, or violent in nature.

## 1 Introduction

Large language models, also known as LLMs, have become an increasingly prevalent part of our day-to-day lives, with their use extending to a wide range of domains including web browsing, voice assistants, and coding assistance tools.\[1, 2, 3, 4\] These models have the potential to significantly impact society in numerous ways.\[5, 6, 7\] This system card analyzes GPT-4, the latest large language model in the GPT family of models.\[8, 9, 10\] Since it finished training in August of 2022, we have been evaluating, adversarially testing, and iteratively improving the model and the system-level mitigations around it. Our mitigations and processes alter GPT-4’s behavior and prevent certain kinds of misuses, though they have limitations, pointing to the need for anticipatory planning and governance\[11\] and further safety research. Our approach to deployment balances minimizing risk from deployment, enabling positive use cases, and learning from deployment. GPT models are often trained in two stages. First, they are trained, using a large dataset of text from the Internet, to predict the next word. The models are then fine-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF), to produce outputs that are preferred by human labelers.\[10, 12, 13\] Training language models on large text datasets has given rise to capabilities such as few-shot learning\[10\] and the ability to carry out a wide range of natural language tasks spanning different domains, including question answering, arithmetic, and classification. Fine-tuning has made these models more controllable and useful.

**1.1 Overview of find

... (truncated, 98 KB total)
Resource ID: ebab6e05661645c5 | Stable ID: OTk5MDMwNT