[2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
DeepSeek-R1 demonstrates that large-scale reinforcement learning alone can elicit strong reasoning capabilities in LLMs, matching OpenAI-o1 performance. This is significant for AI safety as it shows rapid capability jumps via RL and raises questions about emergent behaviors and alignment during self-supervised reasoning development.
Metadata
Summary
DeepSeek-R1 introduces reasoning models trained via large-scale reinforcement learning, with DeepSeek-R1-Zero achieving strong reasoning without supervised fine-tuning. DeepSeek-R1 adds cold-start data and multi-stage training to address readability and language-mixing issues, achieving performance comparable to OpenAI-o1. The authors open-source the models and six distilled variants ranging from 1.5B to 70B parameters.
Key Points
- •DeepSeek-R1-Zero trained purely via RL (no SFT) naturally emerges with powerful reasoning behaviors, improving AIME 2024 pass@1 from 15.6% to 71.0%.
- •DeepSeek-R1 uses cold-start data and multi-stage RL training to fix readability/language-mixing issues while matching OpenAI-o1-1217 on reasoning benchmarks.
- •Six distilled dense models (1.5B–70B) based on Qwen and Llama are open-sourced, democratizing access to strong reasoning models.
- •The work demonstrates inference-time scaling via chain-of-thought length, with emergent reasoning behaviors arising spontaneously from the RL process.
- •Pure RL without supervised data can drive significant capability jumps, raising important questions about emergent behaviors and alignment in self-evolving models.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Preference Optimization Methods | Approach | 62.0 |
Cached Content Preview
\reportnumber
001
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance,
we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks.
To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
Figure 1:
Benchmark performance of DeepSeek-R1.
1 Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a ; Anthropic, 2024 ; Google, 2024 ) , progressively diminishing the gap towards Artificial General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b ) series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community.
Several prior works have explored various approaches, including process-based reward models (Uesato et al., 2022 ; Lightman et al., 2023 ; Wang et al., 2023 ) , reinforcement learning (Kumar et al., 2024 ) , and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024 ; Xin et al., 2024 ; Trinh et al., 2024 ) .
However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL).
Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process.
Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024 ) as the RL fra
... (truncated, 59 KB total)4cd8fde22db0b147 | Stable ID: sid_vxj9vz8C3n