DeepSeek | BestHub

Collection size

2 articles

Page 1 of 1

Architect

Feb 25, 2025 · Artificial Intelligence

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

DeepSeekLLM TrainingModel Distillation

0 likes · 24 min read

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DataFunTalk

Feb 16, 2025 · Artificial Intelligence

Understanding Reasoning LLMs: DeepSeek R1 Variants, Inference‑Time Scaling, and Training Strategies

This article explains what reasoning language models are, outlines their strengths and weaknesses, details DeepSeek R1's three variants and their training pipelines—including pure reinforcement learning, SFT + RL, and distillation—while also discussing inference‑time scaling techniques and related research such as Sky‑T1 and TinyZero.

DeepSeekModel DistillationSupervised Fine-tuning

0 likes · 16 min read

Understanding Reasoning LLMs: DeepSeek R1 Variants, Inference‑Time Scaling, and Training Strategies