How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Knowledge DistillationModel CompressionOPD

0 likes · 20 min read

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

DeepSeek V4LLM trainingMulti-Expert Models

0 likes · 11 min read

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

Machine Learning Algorithms & Natural Language Processing

Apr 12, 2026 · Artificial Intelligence

Deep Dive into Forward vs Reverse KL Divergence: When to Use Which?

The article explains the definitions, properties, and asymmetric nature of KL divergence, compares Forward KL (mean‑seeking) and Reverse KL (mode‑seeking) through bimodal examples, and provides practical guidelines for choosing between them based on sampling and probability‑evaluation capabilities in machine‑learning tasks.

Forward KLKL divergenceMachine Learning

0 likes · 10 min read

Deep Dive into Forward vs Reverse KL Divergence: When to Use Which?