[2412.19437] DeepSeek-V3 Technical Report

paper

2024·arXiv·arxiv.org/abs/2412.19437

Authors

DeepSeek-AI·Aixin Liu·Bei Feng·Bing Xue·Bingxuan Wang·Bochao Wu·Chengda Lu·Chenggang Zhao·Chengqi Deng·Chenyu Zhang·Chong Ruan·Damai Dai·Daya Guo·Dejian Yang·Deli Chen·Dongjie Ji·Erhang Li·Fangyun Lin·Fucong Dai·Fuli Luo·Guangbo Hao·Guanting Chen·Guowei Li·H. Zhang·Han Bao·Hanwei Xu·Haocheng Wang·Haowei Zhang·Honghui Ding·Huajian Xin·Huazuo Gao·Hui Li·Hui Qu·J. L. Cai·Jian Liang·Jianzhong Guo·Jiaqi Ni·Jiashi Li·Jiawei Wang·Jin Chen·Jingchang Chen·Jingyang Yuan·Junjie Qiu·Junlong Li·Junxiao Song·Kai Dong·Kai Hu·Kaige Gao·Kang Guan·Kexin Huang·Kuai Yu·Lean Wang·Lecong Zhang·Lei Xu·Leyi Xia·Liang Zhao·Litong Wang·Liyue Zhang·Meng Li·Miaojun Wang·Mingchuan Zhang·Minghua Zhang·Minghui Tang·Mingming Li·Ning Tian·Panpan Huang·Peiyi Wang·Peng Zhang·Qiancheng Wang·Qihao Zhu·Qinyu Chen·Qiushi Du·R. J. Chen·R. L. Jin·Ruiqi Ge·Ruisong Zhang·Ruizhe Pan·Runji Wang·Runxin Xu·Ruoyu Zhang·Ruyi Chen·S. S. Li·Shanghao Lu·Shangyan Zhou·Shanhuang Chen·Shaoqing Wu·Shengfeng Ye·Shengfeng Ye·Shirong Ma·Shiyu Wang·Shuang Zhou·Shuiping Yu·Shunfeng Zhou·Shuting Pan·T. Wang·Tao Yun·Tian Pei·Tianyu Sun·W. L. Xiao·Wangding Zeng·Wanjia Zhao·Wei An·Wen Liu·Wenfeng Liang·Wenjun Gao·Wenqin Yu·Wentao Zhang·X. Q. Li·Xiangyue Jin·Xianzu Wang·Xiao Bi·Xiaodong Liu·Xiaohan Wang·Xiaojin Shen·Xiaokang Chen·Xiaokang Zhang·Xiaosha Chen·Xiaotao Nie·Xiaowen Sun·Xiaoxiang Wang·Xin Cheng·Xin Liu·Xin Xie·Xingchao Liu·Xingkai Yu·Xinnan Song·Xinxia Shan·Xinyi Zhou·Xinyu Yang·Xinyuan Li·Xuecheng Su·Xuheng Lin·Y. K. Li·Y. Q. Wang·Y. X. Wei·Y. X. Zhu·Yang Zhang·Yanhong Xu·Yanhong Xu·Yanping Huang·Yao Li·Yao Zhao·Yaofeng Sun·Yaohui Li·Yaohui Wang·Yi Yu·Yi Zheng·Yichao Zhang·Yifan Shi·Yiliang Xiong·Ying He·Ying Tang·Yishi Piao·Yisong Wang·Yixuan Tan·Yiyang Ma·Yiyuan Liu·Yongqiang Guo·Yu Wu·Yuan Ou·Yuchen Zhu·Yuduan Wang·Yue Gong·Yuheng Zou·Yujia He·Yukun Zha·Yunfan Xiong·Yunxian Ma·Yuting Yan·Yuxiang Luo·Yuxiang You·Yuxuan Liu·Yuyang Zhou·Z. F. Wu·Z. Z. Ren·Zehui Ren·Zhangli Sha·Zhe Fu·Zhean Xu·Zhen Huang·Zhen Zhang·Zhenda Xie·Zhengyan Zhang·Zhewen Hao·Zhibin Gou·Zhicheng Ma·Zhigang Yan·Zhihong Shao·Zhipeng Xu·Zhiyu Wu·Zhongyu Zhang·Zhuoshu Li·Zihui Gu·Zijia Zhu·Zijun Liu·Zilin Li·Ziwei Xie·Ziyang Song·Ziyi Gao·Zizheng Pan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety researchers concerned with whether chain-of-thought reasoning can serve as a reliable oversight mechanism; unfaithful CoT undermines scalable oversight and interpretability-based safety approaches.

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Summary

This paper investigates whether chain-of-thought (CoT) reasoning in large language models faithfully represents the actual computational processes used to arrive at answers, or whether it is a post-hoc rationalization. The authors develop metrics and experiments to assess CoT faithfulness and find significant gaps between stated reasoning and underlying model computations, with implications for AI interpretability and oversight.

Key Points

•Chain-of-thought reasoning may not faithfully reflect the internal computations of LLMs, raising concerns about interpretability and oversight.
•The paper introduces evaluation methods to measure the degree to which CoT explanations causally influence model outputs versus being post-hoc rationalizations.
•Findings suggest that LLMs can produce plausible-sounding reasoning that is decoupled from their actual decision-making process.
•Unfaithful CoT has direct implications for AI safety: if reasoning traces cannot be trusted, monitoring model behavior via CoT becomes unreliable.
•Results vary across model sizes and task types, suggesting faithfulness is an emergent and inconsistent property rather than a guaranteed feature.

2 FactBase facts citing this source

Entity	Property	Value	As Of
DeepSeek	Model Parameters	671 billion	Dec 2024
DeepSeek	Description	DeepSeek-V3 training cost approximately $5.58M using 2,788K H800 GPU hours over ~2 months on 2,048 H800 GPUs	Dec 2024

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2412.19437] DeepSeek-V3 Technical Report 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \reportnumber 
 001 

 

 
 DeepSeek-V3 Technical Report

 
 
 
DeepSeek-AI

 research@deepseek.com 
 
 
 
 
 

 
 Abstract

 We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities.
Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
In addition, its training process is remarkably stable.
Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3 .

 
 
 Figure 1: 
Benchmark performance of DeepSeek-V3 and its counterparts.
 
 
 

 
 
 
 1 Introduction

 
 In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution  (OpenAI, 2024a ; Anthropic, 2024 ; Google, 2024 ) , progressively diminishing the gap towards Artificial General Intelligence (AGI).
Beyond closed-source models, open-source models, including DeepSeek series  (DeepSeek-AI, 2024b , c ; Guo et al., 2024 ; DeepSeek-AI, 2024a ) , LLaMA series  (Touvron et al., 2023a , b ; AI@Meta, 2024a , b ) , Qwen series  (Qwen, 2023 , 2024a , 2024b ) , and Mistral series  (Jiang et al., 2023 ; Mistral, 2024 ) , are also making significant strides, endeavoring to close the gap with their closed-source counterparts.
To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.

 
 
 With a forward-looking perspective, we consistently strive for strong model performance and economical costs.
Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA)  (DeepSeek-AI, 2024c ) for efficient inference and DeepSeekMoE  (Dai et al., 2024 ) for cost-effective training.
These two architectures have been validated in DeepSeek-V2  (DeepSeek-AI, 2024c ) , demonstrating their capability to maintain robust model performance while achieving efficient training and inference.
Beyond the basic architecture, we implement two additional strategies to further enhance th

... (truncated, 98 KB total)

Resource ID: kb-de539966b9f54c2c | Stable ID: sid_yRWJldNICQ