Artificial Intelligence 21 min read

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

DeepSeek‑R1 demonstrates that large‑scale reinforcement learning, especially with the novel Group Relative Policy Optimization and a rule‑based reward scheme, can markedly boost reasoning in LLMs without heavy supervised fine‑tuning, while a brief cold‑start SFT phase, two‑stage alignment, and knowledge distillation further improve performance and efficiency, despite remaining challenges such as language mixing.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

The article discusses the DeepSeek-R1 series of large language models, focusing on how reasoning capabilities can be enhanced through large-scale reinforcement learning (RL) without relying on extensive supervised fine‑tuning (SFT). It introduces DeepSeek‑R1‑Zero, a model trained purely via RL using Group Relative Policy Optimization (GRPO), which eliminates the need for a critic model and reduces training cost.

The work shows that even a small amount of SFT for cold‑start can further improve performance. DeepSeek‑R1 builds on this by adding a cold‑start phase with high‑quality chain‑of‑thought data before large‑scale RL, yielding better readability and reasoning performance.

Key technical components include GRPO for efficient RL, a rule‑based reward system combining accuracy and format rewards, language‑consistency rewards to mitigate multilingual mixing, rejection sampling for generating SFT data, and a two‑stage RL alignment process that optimizes helpfulness and harmlessness.

The article also covers model distillation, showing that transferring DeepSeek‑R1’s knowledge to smaller models (e.g., Qwen, Llama) via SFT yields significant gains, whereas pure RL on small models requires massive compute and still falls short of distilled performance.

Finally, it discusses limitations such as language mixing, unsuccessful attempts with process reward models and Monte Carlo tree search, and outlines future directions like exploring ensemble learning and further RL alignment.

DeepSeek-R1reinforcement learningCold StartGRPOLLM reasoningmodel distillation
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.