20 min read

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

Baobao Algorithm Notes

May 26, 2026

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

Developing large language models often spends most of the time handling data, especially in the post‑training stage where merging capabilities is crucial. The simplest methods involve each contributor providing SFT data or averaging adapter weights, but these approaches have become the standard for the past two years.

Recent technical reports from Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4 all adopt a newer technique called On‑Policy Distillation (OPD), which has become a common solution despite differing scenarios.

OPD (On‑Policy Distillation) solves three long‑standing problems in post‑training: low signal density, distribution misalignment, and capability interference. It does so by integrating knowledge in the logit space rather than the parameter space, making merging, transfer, and retention of abilities easier.

To understand OPD, the article first reviews the three stages of LLM training: pre‑training (massive token consumption), mid‑training (domain‑specific data), and post‑training (behavioral fine‑tuning). Post‑training is likened to an internship where models learn to follow instructions, solve math problems, and chat.

Two traditional post‑training routes are described:

On‑policy training : the model generates its own trajectories and receives a single sparse reward per episode.

Off‑policy training : the model imitates pre‑collected answers (SFT), which can cause exposure bias and style mismatch.

Both have advantages and hard limits: on‑policy provides self‑generated data but suffers from sparse feedback; off‑policy offers dense feedback but can misalign with the model’s own distribution.

OPD merges the strengths of both by having a teacher model score every token of the student’s sampled trajectory, providing dense per‑step feedback while preserving the student’s own distribution. The student minimizes the reverse KL divergence between its logits and the teacher’s logits on the sampled tokens.

Reverse KL is preferred over forward KL because it is mode‑seeking, focusing the student on the teacher’s high‑probability outputs rather than averaging over all possibilities. This property is especially suitable for tasks with a single correct solution, such as math reasoning. The use of reverse KL for LLM distillation was first systematized in MiniLLM (ICML 2024), which showed larger gains for smaller students.

OPD’s loss is simply the reverse KL between student and teacher logits for each token. If an existing RL framework is available, integrating OPD requires only a one‑line change: replace the group‑normalized advantage with the teacher‑student log‑ratio.

# Initialize teacher client (main):
teacher_client = service_client.create_sampling_client(
    base_model=teacher_config.base_model,
    model_path=teacher_config.load_checkpoint_path,
)

# Sample trajectories (main):
trajectories = do_group_rollout(student_client, env_group_builder)
sampled_logprobs = trajectories.loss_fn_inputs["logprobs"]

# Compute reward (compute_teacher_reverse_kl):
teacher_logprobs = teacher_client.compute_logprobs(trajectories)
reverse_kl = sampled_logprobs - teacher_logprobs
trajectories["advantages"] = -reverse_kl

# Train with RL (train_step):
training_client.forward_backward(trajectories, loss_fn="importance_sampling")

Benchmark results on the AIME’24 math reasoning suite (starting from the same off‑policy distillation checkpoint) show:

Off‑policy distillation: 60 % score, 1× compute.

Pure RL: 67.6 % score, 10× compute.

OPD: 74.4 % score, 1× compute.

Qwen3 reports that OPD reduces the total GPU time for training a suite of lightweight models to only one‑tenth of the full four‑stage RL pipeline, and the signal density of OPD is estimated to be 50–100× higher than pure RL.

The article then details four design divergences among the four teams:

KL granularity : most teams use token‑level KL (a Monte‑Carlo estimate with low memory cost). DeepSeek‑V4 uses full‑vocabulary KL, requiring engineered kernels, teacher weight scheduling, and hidden‑state caching to handle the massive memory demand.

Additional reward signals : GLM‑5 and DeepSeek‑V4 rely solely on KL; MiMo‑V2 adds an Outcome Reward Model (ORM) on top of KL, showing that KL accelerates convergence while ORM aligns final answers.

Teacher selection : GLM‑5 reuses a checkpoint from its own earlier RL stage; MiMo‑V2 ensembles multiple expert SFT models, domain‑specific teachers, and even the student itself; DeepSeek‑V4 employs over ten trillion‑parameter experts with three inference‑strength variants each; Qwen3 uses the flagship model as teacher for students ranging from 0.6 B to 30 B parameters.

Pipeline placement : Qwen3 inserts OPD in a lightweight sub‑pipeline to replace the costly fourth RL stage; GLM‑5 places OPD as the final stage to recover forgotten reasoning ability; MiMo‑V2 puts OPD in the third stage to integrate capabilities after domain‑specific RL; DeepSeek‑V4 makes OPD a unified stage, abandoning mixed RL entirely.

Specific outcomes:

Qwen3’s two‑stage approach (off‑policy distillation followed by OPD) enables a 30 B student to inherit the reasoning power of a 32 B model with only 1/10 of the GPU time.

GLM‑5’s four‑stage RL pipeline suffers from catastrophic forgetting in its final General RL stage; OPD as a final “recovery” stage restores the lost reasoning ability, allowing group size = 1 and batch size = 1024 for higher throughput.

MiMo‑V2’s three‑stage pipeline resolves the “see‑saw” effect (trade‑off between math and code abilities) by keeping parameter space separate for each domain and merging in logit space via OPD; the student even surpasses individual teachers on AIME 2025 (+0.2) and HMMT Feb 2025 (+1.8).

DeepSeek‑V4’s full‑vocabulary KL and engineering stack enable knowledge compression from dozens of expert models, achieving superior performance over GPT‑5.2 and Gemini‑3.0‑Pro on knowledge‑intensive tasks.

Despite its strengths, OPD has limitations:

It relies entirely on teacher quality; defects in the teacher are propagated.

It lacks exploration capability, so it is often paired with RL for novelty discovery.

Long‑sequence training incurs linear memory growth because the teacher must process every token of the student’s trajectory.

Student initialization must be strong enough to generate useful trajectories; otherwise the KL signal degrades to noise correction.

Overall, OPD provides a cheap yet powerful approach to post‑training by combining on‑policy sampling with dense reward signals, offering a scalable method to compress and integrate capabilities in the logit space.

References

Qwen3 Technical Report (arXiv 2505.09388)

GLM‑5: from Vibe Coding to Agentic Engineering (arXiv 2602.15763)

MiMo‑V2‑Flash Technical Report (arXiv 2601.02780)

DeepSeek‑V4 Technical Report

On‑Policy Distillation, thinkingmachines.ai

MiniLLM: Knowledge Distillation of Large Language Models via Reverse KL Divergence (ICML 2024)

DAGGER: An Algorithm for Reduction of Expert Failures

Process Reward Modeling: Learning to Verify without Multi‑Agent Oracles

thinkingmachines.ai/blog/on-policy-distillation/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Compression Large Language Models Knowledge Distillation on-policy distillation OPD Reverse KL

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.