[2412.19437] DeepSeek-V3 Technical Report
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to AI safety researchers concerned with whether chain-of-thought reasoning can serve as a reliable oversight mechanism; unfaithful CoT undermines scalable oversight and interpretability-based safety approaches.
Metadata
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Summary
This paper investigates whether chain-of-thought (CoT) reasoning in large language models faithfully represents the actual computational processes used to arrive at answers, or whether it is a post-hoc rationalization. The authors develop metrics and experiments to assess CoT faithfulness and find significant gaps between stated reasoning and underlying model computations, with implications for AI interpretability and oversight.
Key Points
- •Chain-of-thought reasoning may not faithfully reflect the internal computations of LLMs, raising concerns about interpretability and oversight.
- •The paper introduces evaluation methods to measure the degree to which CoT explanations causally influence model outputs versus being post-hoc rationalizations.
- •Findings suggest that LLMs can produce plausible-sounding reasoning that is decoupled from their actual decision-making process.
- •Unfaithful CoT has direct implications for AI safety: if reasoning traces cannot be trusted, monitoring model behavior via CoT becomes unreliable.
- •Results vary across model sizes and task types, suggesting faithfulness is an emergent and inconsistent property rather than a guaranteed feature.
2 FactBase facts citing this source
| Entity | Property | Value | As Of |
|---|---|---|---|
| DeepSeek | Model Parameters | 671 billion | Dec 2024 |
| DeepSeek | Description | DeepSeek-V3 training cost approximately $5.58M using 2,788K H800 GPU hours over ~2 months on 2,048 H800 GPUs | Dec 2024 |
Cached Content Preview
[2412.19437] DeepSeek-V3 Technical Report
\reportnumber
001
DeepSeek-V3 Technical Report
DeepSeek-AI
research@deepseek.com
Abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities.
Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
In addition, its training process is remarkably stable.
Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3 .
Figure 1:
Benchmark performance of DeepSeek-V3 and its counterparts.
1 Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a ; Anthropic, 2024 ; Google, 2024 ) , progressively diminishing the gap towards Artificial General Intelligence (AGI).
Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024b , c ; Guo et al., 2024 ; DeepSeek-AI, 2024a ) , LLaMA series (Touvron et al., 2023a , b ; AI@Meta, 2024a , b ) , Qwen series (Qwen, 2023 , 2024a , 2024b ) , and Mistral series (Jiang et al., 2023 ; Mistral, 2024 ) , are also making significant strides, endeavoring to close the gap with their closed-source counterparts.
To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.
With a forward-looking perspective, we consistently strive for strong model performance and economical costs.
Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c ) for efficient inference and DeepSeekMoE (Dai et al., 2024 ) for cost-effective training.
These two architectures have been validated in DeepSeek-V2 (DeepSeek-AI, 2024c ) , demonstrating their capability to maintain robust model performance while achieving efficient training and inference.
Beyond the basic architecture, we implement two additional strategies to further enhance th
... (truncated, 98 KB total)kb-de539966b9f54c2c | Stable ID: sid_yRWJldNICQ