36 min read

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

DataFunTalk

Feb 28, 2025

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

This article provides a comprehensive review of the DeepSeek large language model (LLM) series, including DeepSeek‑67B (V1), DeepSeek‑V2, DeepSeek‑V3, and the DeepSeek‑R1 family.

Model Versions

V1 (67B) uses a dense LLaMA‑2 architecture, 2 trillion bilingual tokens, and achieves better performance than LLaMA‑2 70B after SFT and DPO fine‑tuning.

V2 (236B) adopts a Mixture‑of‑Experts (MoE) design with 160 routing experts, introduces Multi‑Head Latent Attention (MLA) and DeepSeekMoE, and reduces training cost by 42.5% while expanding the context window to 128K tokens.

V3 (671B) expands MoE to 256 routing experts, adds auxiliary‑loss‑free load balancing, multi‑token prediction (MTP), and FP8 mixed‑precision training, achieving state‑of‑the‑art results on benchmarks comparable to GPT‑4o and Claude‑3.5‑Sonnet.

R1 series focuses on reasoning capability, using reinforcement learning without supervised data (R1‑Zero) and a cold‑start phase with rule‑based rewards, followed by a second stage that aligns usefulness and safety.

Key Architectural Innovations

Grouped‑Query Attention (GQA) replaces multi‑head attention to reduce KV‑cache memory.

MLA compresses attention inputs into a latent vector, enabling 93.3% KV‑cache reduction and up to 5.76× faster generation.

DeepSeekMoE separates shared and routing experts, improving parameter efficiency.

Multi‑step learning‑rate scheduler replaces cosine decay, allowing 80% of training steps to be reused across experiments.

FP8 mixed‑precision training and DualPipe pipeline reduce compute cost to 2.788 M H800 GPU‑hours.

Training and Evaluation

Pre‑training on up to 14.8 trillion high‑quality tokens, with multilingual data and extended context via YaRN (4K → 32K → 128K).

Supervised fine‑tuning (SFT) on 1.5 M multi‑domain examples, followed by reinforcement learning (GRPO) with rule‑based and model‑based reward models.

Extensive benchmark suite covering knowledge (MMLU, GPQA), technical ability (MATH, LiveCodeBench), and open‑ended evaluation (AlpacaEval 2.0), where DeepSeek‑V3 matches or exceeds top open‑source and closed‑source models.

Distillation Experiments

Knowledge distillation from DeepSeek‑R1 to smaller open‑source models (Qwen, LLaMA) using ~800 K reasoning samples improves small‑model performance dramatically.

Open‑source project Open‑R1 demonstrates the pipeline for generating high‑quality reasoning data and applying it to various model sizes.

The article concludes with a discussion of remaining challenges in R1, such as general capability gaps, language mixing, prompt sensitivity, and software‑engineering tasks, and provides references to open‑source implementations for reproducibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts scaling laws

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.