Artificial Intelligence 10 min read

DeepSeek R1: Concept Overview, Training Principles, and Practical Implementations

This article introduces the DeepSeek family of models, explains the concepts of online search and deep reasoning, details the two‑phase training pipeline with data augmentation and reinforcement learning, and showcases practical experiments and deployment examples for the R1 and distilled variants.

JD Tech Talk
JD Tech Talk
JD Tech Talk
DeepSeek R1: Concept Overview, Training Principles, and Practical Implementations

1. Concept Overview (Beginner Friendly) The DeepSeek website offers two buttons under the chat box: one for online search and another for deep reasoning (R1). Online search addresses the timeliness limitation of large language models (LLMs) whose training data may be six months to a year old, while deep reasoning (R1) focuses on strong inference capabilities under limited resources.

Model Variants DeepSeek refers to the whole series. DeepSeek V3 is the latest base dialogue model (671B parameters, requiring ~1300 GB GPU memory). DeepSeek R1 is a reasoning model praised for superior inference accuracy, though its reasoning process is longer. R1‑zero is an experimental predecessor of R1. DeepSeek‑R1‑Distill‑Qwen‑xxxB is a knowledge‑distilled version using 800 k intermediate training samples.

2. Training Principles

The training of DeepSeek R1 follows a two‑stage iterative optimization:

Phase 1 – COT Data Quality Improvement Uses DeepSeek V3 Base as the starting model, applies supervised fine‑tuning (SFT) on initial chain‑of‑thought (COT) data, then reinforcement‑learning (RL) to produce Model RL‑1, which generates higher‑quality COT data for later use.

Phase 2 – Clean Base Re‑training Returns to the original DeepSeek V3 Base, mixes the newly generated high‑quality COT data with non‑logical data from V3 to prevent forgetting, and performs additional SFT epochs followed by two RL stages: the first enhances reasoning (using rule‑based reward models), the second improves helpfulness and harmlessness.

Core Training Techniques

Iterative data augmentation: each stage generates better data for the next.

Base model reset: always start from a clean base to avoid error accumulation.

Forgetting mitigation: mix logical and non‑logical data to retain multi‑task abilities.

3. Technical Value Reflections

R1‑zero demonstrates that strong reasoning can emerge from pure RL without explicit COT data, showing self‑evolution toward longer responses and self‑reflection. Scaling RL steps yields longer, more reflective outputs. For smaller models, knowledge distillation can achieve comparable or better gains than RL alone.

4. Practical Projects and Experiments

High‑School Math Test : Evaluated several models (Claude 3.5, O1‑preview, Qwen2.5‑Math‑72B, DeepSeek‑R1, DeepSeek‑R1‑Distill‑Qwen‑32B) on 19 2024 Gaokao math questions.

Deepscaler : UC Berkeley team fine‑tuned a distilled 1.5 B model (DeepScaleR‑1.5B‑Preview) via simple RL, achieving 43.1 % Pass@1 on the AIME 2024 benchmark, surpassing OpenAI o1‑preview despite the small parameter count.

Logic‑RL : A Chinese university group reproduced results on the Logic Puzzle Dataset, showing that after three‑stage rule‑based RL, the model learned hesitation, multi‑path exploration, back‑tracking, staged summarization, and final answer verification.

Open R1 : HuggingFace team released a fully open‑source reproduction of DeepSeek‑R1, filling in previously undisclosed technical details.

5. Local Deployment and Usage

In the RAG‑enhanced chatbot 5starAI , the DeepSeek‑R1‑Distill‑Qwen‑32B‑4bit model was deployed via vLLM, consuming ~50 tokens/s. Integration was done within the LlamaIndex framework, with code snippets (shown as images) illustrating the setup and sample outputs.

Additional RL training plans involve using DeepSeek‑R1‑Distill‑Qwen‑1.5B as a base for text‑to‑SQL tasks, leveraging prior datasets and the Deepscaler project experience.

LLMDeepSeekmodel trainingreinforcement learningknowledge distillationR1
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.