How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Knowledge DistillationModel CompressionOPD

0 likes · 20 min read

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDlarge language modelson-policy distillation

0 likes · 32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

Machine Heart

Apr 14, 2026 · Artificial Intelligence

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks

The article critiques binary success rate for long‑horizon robotic tasks, proposes the PRM-as-a-Judge framework with a potential‑based progress signal and the three‑layer OPD metric suite, validates it on the RoboPulse benchmark, and shows how it yields fine‑grained, diagnostic insights into policy performance.

Embodied AIOPDRoboPulse

0 likes · 20 min read

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks