Artificial Intelligence 8 min read

DeepSeek‑R1: From Zero to Full‑Featured AI Model via Cold‑Start Data and Multi‑Stage Training

The article explains how DeepSeek‑R1 improves upon the Zero version by introducing expert‑crafted cold‑start data and a four‑phase multi‑stage training pipeline, resulting in markedly better reasoning, coding, and general knowledge performance across benchmark tests.

Architect

Feb 8, 2025

DeepSeek‑R1: From Zero to Full‑Featured AI Model via Cold‑Start Data and Multi‑Stage Training

In the previous article we examined DeepSeek‑R1‑Zero, an AI reasoning model that learns through reinforcement learning (RL) without supervised data, but suffers from poor interpretability and language mixing.

To address these issues, the DeepSeek team released DeepSeek‑R1, an upgraded model that adds "cold‑start data" and a multi‑stage training process, preserving the strong reasoning of R1‑Zero while substantially improving overall performance and stability.

R1's "upgrade secret": Cold‑Start Data + Multi‑Stage Training

Cold‑start data acts like a seasoned mentor, providing carefully designed chain‑of‑thought (CoT) examples that teach the model optimal reasoning steps, similar to textbook problems. This helps DeepSeek‑R1 acquire the correct reasoning posture early, avoiding the random exploration of R1‑Zero.

1. "Cold‑Start Data": Expert‑crafted examples that give the model a head start

These data are not random; they are human‑expert samples that illustrate the best problem‑solving strategies, such as drawing a diagram for geometry or analyzing the question first for word problems.

2. "Multi‑Stage Training": Gradual, staged development of a full‑stack AI

The training is divided into four phases, each strengthening a different capability:

Stage 1 – Foundation (Cold‑Start SFT): Supervised fine‑tuning on the cold‑start data builds basic reasoning skills, akin to elementary school learning.

Stage 2 – Hard Problems (Reasoning‑Oriented RL): Reinforcement learning refines the model on more complex tasks and introduces language‑consistency rewards to prevent language mixing.

Stage 3 – Knowledge Expansion (Reject Sampling + SFT): The model learns to generate articles and answer diverse questions; reject sampling selects the best outputs for efficient training.

Stage 4 – Full‑Scene RL: Diverse reward signals train the model for real‑world applicability, improving usefulness and safety.

3. Performance Highlights

After this intensive training, DeepSeek‑R1 achieves impressive results:

Reasoning: Pass@1 79.8% on AIME 2024, surpassing GPT‑4‑0125; 97.3% on MATH‑500.

Coding: Outperforms 96.3% of human participants on Codeforces.

Knowledge: Strong scores on MMLU and GPQA Diamond.

Open‑ended generation: Significant gains on AlpacaEval 2.0 and ArenaHard, producing fluent, human‑like responses.

Conclusion: DeepSeek‑R1’s Transformative Leap

DeepSeek‑R1 inherits the powerful reasoning of its predecessor while overcoming its shortcomings through cold‑start data and multi‑stage training, evolving from a "specialist prodigy" into a "well‑rounded AI scholar" capable of reasoning, coding, writing, and broader real‑world assistance.

These innovations open new possibilities for the future development of reasoning models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek AI inference reinforcement learning cold-start data multi-stage training

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.