Artificial Intelligence 36 min read

DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

An in‑depth overview of the DeepSeek LLM series (V1‑V3) and the R1 models, covering their architectures, scaling‑law experiments, data pipelines, training strategies—including MoE, MLA, FP8, multi‑step learning‑rate scheduling, reinforcement learning, and extensive evaluation results, as well as knowledge‑distillation techniques.

DataFunTalk
DataFunTalk
DataFunTalk
DeepSeek LLM Series (V1‑V3) and R1: Architecture, Training Strategies, Evaluation, and Distillation

This article provides a comprehensive review of the DeepSeek large language model (LLM) series, including DeepSeek‑67B (V1), DeepSeek‑V2, DeepSeek‑V3, and the DeepSeek‑R1 family.

Model Versions

V1 (67B) uses a dense LLaMA‑2 architecture, 2 trillion bilingual tokens, and achieves better performance than LLaMA‑2 70B after SFT and DPO fine‑tuning.

V2 (236B) adopts a Mixture‑of‑Experts (MoE) design with 160 routing experts, introduces Multi‑Head Latent Attention (MLA) and DeepSeekMoE, and reduces training cost by 42.5% while expanding the context window to 128K tokens.

V3 (671B) expands MoE to 256 routing experts, adds auxiliary‑loss‑free load balancing, multi‑token prediction (MTP), and FP8 mixed‑precision training, achieving state‑of‑the‑art results on benchmarks comparable to GPT‑4o and Claude‑3.5‑Sonnet.

R1 series focuses on reasoning capability, using reinforcement learning without supervised data (R1‑Zero) and a cold‑start phase with rule‑based rewards, followed by a second stage that aligns usefulness and safety.

Key Architectural Innovations

Grouped‑Query Attention (GQA) replaces multi‑head attention to reduce KV‑cache memory.

MLA compresses attention inputs into a latent vector, enabling 93.3% KV‑cache reduction and up to 5.76× faster generation.

DeepSeekMoE separates shared and routing experts, improving parameter efficiency.

Multi‑step learning‑rate scheduler replaces cosine decay, allowing 80% of training steps to be reused across experiments.

FP8 mixed‑precision training and DualPipe pipeline reduce compute cost to 2.788 M H800 GPU‑hours.

Training and Evaluation

Pre‑training on up to 14.8 trillion high‑quality tokens, with multilingual data and extended context via YaRN (4K → 32K → 128K).

Supervised fine‑tuning (SFT) on 1.5 M multi‑domain examples, followed by reinforcement learning (GRPO) with rule‑based and model‑based reward models.

Extensive benchmark suite covering knowledge (MMLU, GPQA), technical ability (MATH, LiveCodeBench), and open‑ended evaluation (AlpacaEval 2.0), where DeepSeek‑V3 matches or exceeds top open‑source and closed‑source models.

Distillation Experiments

Knowledge distillation from DeepSeek‑R1 to smaller open‑source models (Qwen, LLaMA) using ~800 K reasoning samples improves small‑model performance dramatically.

Open‑source project Open‑R1 demonstrates the pipeline for generating high‑quality reasoning data and applying it to various model sizes.

The article concludes with a discussion of remaining challenges in R1, such as general capability gaps, language mixing, prompt sensitivity, and software‑engineering tasks, and provides references to open‑source implementations for reproducibility.

Large Language ModelsMixture of Expertsreinforcement learningscaling lawsAI researchModel Distillation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.