DiffusionOPD: A New Online Policy Distillation Paradigm for Multi‑Task Diffusion Models
DiffusionOPD introduces a unified on‑policy distillation framework for diffusion models that decouples single‑task online policy exploration from multi‑task capability integration, training expert teachers per task and distilling their skills into a single student model, achieving faster convergence and higher performance across composition, OCR, and aesthetic tasks.
Recent advances in diffusion models have improved single‑task performance such as text generation quality, composition accuracy, or aesthetic appeal, but integrating these abilities into one model remains difficult because tasks interfere with each other and training objectives become unstable.
Researchers from Fudan University and Alibaba Tongyi Wanxiang argue that multi‑task reinforcement learning should be split into two independent processes: (1) single‑task online policy exploration and (2) multi‑task capability integration.
Based on this view, they propose DiffusionOPD , a unified perspective of On‑Policy Distillation (OPD) for diffusion models, accompanied by a theoretical and experimental framework.
The core idea is to first train separate expert teacher models for each task (e.g., GenEval with DiffusionNFT, OCR and Aesthetic with GRPO‑Guard). Because each teacher focuses on a single task, cross‑task interference is avoided.
Then, a student model initialized from a pretrained diffusion model undergoes online policy distillation: the student generates denoising trajectories for each task, and the corresponding teacher provides supervision at every denoising step. This allows the student to inherit all teachers’ policies without re‑exploring each task from scratch.
To formulate the OPD objective for diffusion, the authors treat the denoising process as a continuous‑state Markov chain, where each transition is a Gaussian kernel. Both student and teacher define their own transition distributions. Because the transition covariances are identical, the reverse‑KL objective reduces to an analytically tractable mean‑matching loss with zero Monte‑Carlo variance.
The framework also unifies stochastic SDE samplers and deterministic ODE samplers; under ODE the loss becomes an L2 match between means.
Comparing with a PPO‑style policy‑gradient approach, the paper proves that the closed‑form KL gradient and the PPO gradient are equal in expectation, but PPO adds a score‑function term whose expectation is zero yet variance is non‑zero, making PPO gradients noisier. Moreover, PPO requires log‑probability and ratio calculations, which are undefined for deterministic ODE samplers, limiting PPO to SDE samplers only.
Extensive experiments compare DiffusionOPD with prior multi‑task RL baselines (Joint Multi‑Task Optimization and Cascade RL) and with single‑task teacher models. Quantitative results show that DiffusionOPD converges faster and reaches higher performance ceilings on GenEval, OCR, and aesthetic tasks. Qualitative visual comparisons illustrate superior generation quality across all tasks.
Additional ablation studies fix the same set of expert teachers and distill them using DiffusionOPD, DMD, TDM, and SFT. The results confirm that DiffusionOPD consistently achieves faster convergence and higher final performance than the other distillation methods.
Finally, loss‑function and sampler‑type ablations show that the closed‑form KL objective outperforms PPO‑style gradients, and lower noise levels (turning SDE samplers into ODE samplers) lead to faster convergence and higher performance ceilings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
