vLLM Introduces Native RL API for Seamless Weight Synchronization

vLLM’s new native RL API introduces a four‑stage weight‑transfer protocol, pluggable backends, and a keep‑mode pause/resume mechanism that eliminates deadlocks in DPEP deployments, with large‑scale validations on SkyRL and Prime‑RL demonstrating reliability and performance gains.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM Introduces Native RL API for Seamless Weight Synchronization

Four‑Stage Weight Transfer Protocol

The protocol covers the full lifecycle of weight synchronization between trainer and inference workers.

Stage 1 – init_weight_transfer_engine : Called once before training starts to create a communication channel. With the NCCL backend, trainer rank 0 joins a shared NCCL process group with all inference workers.

Stage 2 – start_weight_update : Invoked after each training step (or batch) to prepare the inference worker for receiving new weights.

Stage 3 – update_weights : Core transfer step. Supports full‑model or chunked transmission. Two backends are available:

NCCL : Uses NCCL broadcast for scenarios where training and inference run on different GPUs.

IPC : Uses CUDA IPC shared memory for co‑located training and inference on the same GPU.

Both backends support packed tensor transmission to reduce serialization overhead.

Stage 4 – finish_weight_update : Performs post‑processing such as FP8 quantization and converts checkpoint format to the kernel format required by the inference engine.

Configuration example (Python):

from vllm import LLM
from vllm.config import WeightTransferConfig

llm = LLM(
    model="my-model",
    weight_transfer_config=WeightTransferConfig(backend="nccl"),
)

Command‑line example:

vllm serve my-model \
    --weight-transfer-config '{"backend": "nccl"}'

Pluggable Backend Design

The API abstracts the transfer logic into WeightTransferEngine. Custom backends can be created by subclassing this engine and implementing init_transfer_engine and receive_weights, then registering the engine with WeightTransferEngineFactory.register_engine.

from vllm.distributed.weight_transfer import WeightTransferEngineFactory
WeightTransferEngineFactory.register_engine("my_backend", MyWeightTransferEngine)

vLLM provides a prototype based on the Etha project that demonstrates M‑to‑N sharded weight transfer, avoiding the bandwidth waste of broadcasting the full model to every inference worker.

Asynchronous RL: keep mode and DPEP deadlock fix

vLLM adds pause_generation and resume_generation endpoints (HTTP POST /pause and /resume). Three modes are supported:

abort : Terminates all in‑flight requests; clients must retry.

wait : Waits for all requests to finish before updating weights; clients need not retry but cannot perform asynchronous RL.

keep : Freezes ongoing requests without discarding them; clients need not retry and asynchronous RL is possible. The clear_cache flag controls whether the KV cache is cleared during pause.

await engine.pause_generation(mode="keep")
# update weights
await engine.resume_generation()

In Data‑Parallel + Expert‑Parallel (DPEP) deployments a deadlock occurred because pause logic was handled in the AsyncLLM layer while DP coordination messages ( START_DP_WAVE) were exchanged in EngineCore/DPCoordinator. The fix moves pause handling into EngineCore and introduces a two‑phase pause/resume protocol:

Phase 1 (local pause) : Each engine pauses scheduling but continues to respond to START_DP_WAVE requests, avoiding blockage in collective operations.

Phase 2 (global pause) : After every 32 steps, all ranks perform an all‑reduce to check if every engine has entered local pause; if so, they transition to a global pause.

This guarantees that no rank is stuck waiting and that START_DP_WAVE remains respected after a pause request.

Practical validation

SkyRL integration

SkyRL (Berkeley Sky Computing Lab) integrates the native RL API with a split data plane ( /v1/completions via VLLMRouter) and control plane ( /pause, /resume, weight‑sync endpoints) that fan‑out to all replicas.

BroadcastTransferStrategy (non‑co‑located): Trainer rank 0 broadcasts tensors via NCCL and sends metadata via HTTP /update_weights, combined with /pause?mode=keep and /resume.

CudaIpcTransferStrategy (co‑located): Trainer and inference share a GPU, exchanging weights via CUDA IPC handles, with /sleep and /wake_up for memory management.

SkyRL successfully ran Qwen3‑1.7B with DAPO asynchronous training on 4 training GPUs + 4 inference engines using the native API.

Prime‑RL large‑scale validation

Prime‑RL evaluated the API on a 16‑node, 8×H200 cluster (100+ steps) with the following setup:

Inference side : zai-org/GLM-5.1-FP8, P/D‑separated deployment, 2 replicas each with 4 P+4 D, DPEP32, 1 TB CPU KV‑cache offloading, vllm-router for cache‑aware sticky routing.

Training side : zai-org/GLM-5.1 (BF16) on another 16‑node cluster using the IcePop algorithm.

The run showed stable weight updates, rising performance, upward RL curves, and stable KL mismatch, confirming that the four‑stage protocol and two‑phase pause/resume work reliably at scale.

Roadmap highlights (Q2 2026)

K8s‑Native Weight Transfer (WPI) : PR #40828 integrates a Weight Propagation Interface that zero‑copies trainer weights directly into inference GPU memory. Tests on Qwen2‑7B (≈14.19 GB) achieved ~20.43 GB/s bandwidth, ~694 ms total transfer.

RDMA‑Aware Sharded Transfer : RFC #40822 proposes recording the full conversion graph (quantization, fusion, padding) so the trainer can write weights directly to the inference memory layout via RDMA WRITE. A proof‑of‑concept reduced transfer time for an 800‑B model from 40‑50 s to 1 s.

Other explored ideas include sparse weight updates, NCCL context offload/resume, and training‑inference log‑prob/logit consistency fixes.

Current limitations

NCCL backend uses broadcast semantics, which can be sub‑optimal for bandwidth.

HTTP endpoints require dev mode ( VLLM_SERVER_DEV_MODE=1) to be enabled.

Documentation for checkpoint‑to‑kernel conversion is incomplete.

RDMA‑based direct transfer remains experimental.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed inferencevLLMNCCLweight synchronizationasynchronous RLCUDA IPCRL API
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.