Tagged articles
3 articles
Page 1 of 1
Baobao Algorithm Notes
Baobao Algorithm Notes
May 26, 2026 · Artificial Intelligence

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Knowledge DistillationModel CompressionOPD
0 likes · 20 min read
How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 1, 2026 · Artificial Intelligence

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

DeepSeek V4LLM trainingMulti-Expert Models
0 likes · 11 min read
What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 12, 2026 · Artificial Intelligence

Deep Dive into Forward vs Reverse KL Divergence: When to Use Which?

The article explains the definitions, properties, and asymmetric nature of KL divergence, compares Forward KL (mean‑seeking) and Reverse KL (mode‑seeking) through bimodal examples, and provides practical guidelines for choosing between them based on sampling and probability‑evaluation capabilities in machine‑learning tasks.

Forward KLKL divergenceMachine Learning
0 likes · 10 min read
Deep Dive into Forward vs Reverse KL Divergence: When to Use Which?