VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models
VeRL-Omni is an open‑source RL post‑training framework built on verl and vLLM‑Omni that enables efficient, high‑throughput rollout and flexible reward computation for diffusion, AR‑DiT, and unified multimodal generation models, supporting diverse hardware, modular trainers, and demonstrating up to 14% latency reduction and high training throughput in benchmark experiments.
VeRL‑Omni Overview
VeRL‑Omni is a universal reinforcement‑learning (RL) post‑training framework for multimodal generation models, built on the verl library and vLLM‑Omni. It supports diffusion transformers such as Qwen‑Image , hybrid AR‑DiT architectures like Qwen‑Omni , and unified understanding‑plus‑generation models (e.g., BAGEL , HunyuanImage‑3.0 ).
Motivation
Diffusion & multimodal extension : extend the flexibility and performance of verl to non‑autoregressive RL for diffusion and full‑modal models.
Heterogeneous rollout pipelines : a rollout traverses latent denoising trajectories and may invoke multiple components (text encoder → DiT → VAE) in several stages.
Complex workload scheduling : reward functions are themselves multimodal models (VLM judges, OCR scorers) and multimodal rollouts consume far more peak memory than text‑only generation, making orchestration non‑trivial.
Key Technical Features
Efficient multimodal rollout : integrates vLLM‑Omni asynchronous high‑throughput serving. Accuracy matches diffusers while rollout efficiency is improved through step‑wise continuous batching and embedding caching.
Flexible reward engine : supports rule‑based and model‑based rewards (e.g., VLM‑as‑judge for OCR). Reward inference runs on vLLM and overlaps with rollout and training to reduce end‑to‑end latency.
Modular training backend : provides multiple trainers ( DiffusersFSDP, Megatron, VeOmni) with built‑in optimizations for diffusion and multimodal models, compatible with parallel strategies such as FSDP, USP, and TP.
Broad hardware compatibility : runs on NVIDIA GPUs and Ascend NPUs, allowing flexible backend switching.
End‑to‑end training recipes and benchmarks : includes reference performance results that demonstrate high training throughput.
Algorithm Support
The framework includes the FlowGRPO algorithm, an online policy method for flow‑matching diffusion models.
Getting Started
Installation instructions:
https://verl-omni.readthedocs.io/en/latest/start/install.htmlExample scripts for image, audio, and video RL are located in the examples directory of the repository https://github.com/verl-project/verl-omni/tree/main/examples.
Demo: Qwen‑Image FlowGRPO training uses an OCR reward model ( Qwen3‑VL‑8B‑Instruct) that reads rendered text in generated images and scores it against ground‑truth.
FlowGRPO Algorithm Details
Rollout generation : the diffusion policy generates samples, recording log probabilities and image trajectories.
Reward scoring : a reward model assigns a score to each sample, from which a trajectory advantage is computed.
Policy optimization : a CLIP‑style loss updates the policy using the computed advantage.
Weight synchronization : trainer weights are periodically synchronized to rollout workers so that new samples reflect the latest policy.
Experimental Results
LoRA Fine‑tuning
On an NVIDIA H800 GPU, training throughput reaches the reported level. Placing the reward model on a separate GPU and overlapping its inference with policy training reduces per‑step wall‑clock time by approximately 14%.
Full‑model Fine‑tuning
Non‑CFG full‑model OCR training on four NVIDIA H200 GPUs achieves 0.510 images / GPU / s , with each training step taking about 250 s . After only 120 steps, generated images show a noticeable improvement in text rendering quality. Training curves indicate that both the critic reward and the validation reward converge stably.
Roadmap
Expand model support to emerging diffusion and multimodal models for image, video, and audio generation, as well as unified tasks.
Integrate additional advanced RL algorithms such as DiffusionNFT .
Develop fully asynchronous RL pipelines that further increase rollout throughput and hardware utilization.
Deepen co‑optimization with vLLM‑Omni (parallelism, quantization, batching, scheduling).
Release more highly optimized trainer engines beyond DiffusersFSDPTrainer.
Broaden hardware support, continuing work on Ascend NPU paths and inviting community‑built hardware plugins.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
