“`html

DeepSeek V4 Paper Full Version Released – FP4 QAT Details

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

DeepSeek has released the full version of their DeepSeek V4 paper. The previous preview was 58 pages long; this new version contains a lot more technical depth.

Key Points

FP4 Quantization Aware Training: They are running FP4 QAT directly in the late stage of training. MoE expert weights have been quantized to FP4 (the main GPU memory consumer). The QK path in the CSA indexer uses FP4 activations. This results in a 2x speedup on the QK selector with 99.7% recall preserved. Inference runs directly on the FP4 weights.
Efficiency: The efficiency table is striking. For example, V3.2 has baseline performance, while V4-Pro and V4-Flash have reduced FLOPs to 27% and 10%, respectively.
Training Stability: They documented two fixes for the loss spike problem in their trillion-parameter MoE model. One is Anticipatory routing, where they deliberately desynchronize the main model updates from the router updates. Another is SwiGLU clamping, which imposes hard limits on the SwiGLU linear path and gate path to suppress extreme values.
Generative Reward Model: Instead of using separate reward models for RLHF, they use a single model trained on scored data. This model learns to judge its own outputs with reasoning attached, minimizing human labeling requirements.

The Headline: FP4 QAT and Minimal Quality Degradation

The headline for me is the FP4 Quantization Aware Training (QAT). If this generalizes to how training and inference costs are structured, it could significantly shift the cost structure, especially noticeable in multi-agent setups where one task might spawn 5-10 model calls.

Human Evaluation Results

The human evaluation results show promising outcomes. For Chinese writing tasks, DeepSeek V4-Pro achieved a win rate of 62.7% against Gemini 3.1 pro and 77.5% specifically on writing quality. In white-collar task evaluations across 13 industries with over 30 advanced tasks, DeepSeek V4-Pro-Max showed a non-loss rate of 63%, significantly outperforming the baseline model (Opus 4.6 max). For coding agent evaluations, 52% of users felt that V4-Pro was ready to be their default coding model.

Paper Link in Comments

“`

Source Read original →