How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)
The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.
Background: Speculative Decoding
Large language models generate text token by token, which becomes the primary bottleneck regardless of GPU power. Speculative decoding mitigates this by letting a smaller draft model quickly guess a sequence of tokens, which the large model then verifies in a single forward pass; correct guesses speed up inference, while incorrect ones are simply corrected.
DFlash – Replacing Autoregressive Drafts with Block Diffusion
DFlash (Block Diffusion for Flash Speculative Decoding) from Z Lab introduces a lightweight block diffusion model that generates an entire token block (block size = 16) in one forward pass, eliminating the “slow guessing” problem of traditional draft models.
The key technique is KV Injection : hidden features from multiple layers of the target model are fused into the draft model’s KV cache, enabling high‑quality predictions from the draft.
Benchmark results (T = 0.0) show speedups of:
HumanEval: 6.09× (Qwen3‑30B‑MoE)
MATH‑500: 6.17× (Qwen3‑8B)
GSM8K: 5.20× (Qwen3‑8B)
AIME24: 5.91× (Qwen3‑8B)
MBPP: 4.75× (Qwen3‑8B)
Compared with the popular EAGLE‑3 approach (≈2‑3×), DFlash is about 2.5× faster, reaching 5‑6× acceleration even in sampling mode (Temperature = 1) where many methods degrade.
DDTree – Extending DFlash with a Draft Tree
DDTree (Diffusion Draft Tree), built on DFlash by Liran Ringel, constructs a probability‑tree of multiple promising draft paths using a best‑first heap algorithm, then validates the entire tree in a single forward pass of the target model.
Four‑step DDTree workflow:
Block diffusion generates probability distributions for L positions.
Best‑first heap builds an optimal draft tree under a node budget B.
Tree attention compiles the tree into the target model’s input.
Verification traverses the tree: matching nodes continue, mismatches trigger a bonus token for the next round.
The method has a mathematical guarantee that the constructed tree maximizes the expected accepted length under the draft model’s distribution.
On HumanEval (T = 0.0), DDTree lifts DFlash’s 6.09× speedup to 8.22×, an additional 2.13× gain, while remaining completely lossless—the output distribution matches that of unaccelerated decoding.
Supported Models and Integration
DFlash draft models are available for several mainstream LLMs, including Kimi‑K2.5, Qwen3.5‑4B/9B/27B, Qwen3.5‑35B‑A3B, Qwen3‑Coder‑30B‑A3B, and LLaMA‑3.1‑8B‑Instruct. Drafts for larger models such as Qwen3.5‑122B, 397B, and GLM‑5.1 are in progress.
Integration commands:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
--tp-size 1 --attention-backend trtllm_mha vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' pip install -e ".[mlx]"DDTree benchmark can be run with:
git clone https://github.com/liranringel/ddtree
cd ddtree
pip install -r requirements.txt
bash run_benchmark.sh
python3 plot_results.pyConclusion
The DFlash + DDTree combination represents the next stage of speculative decoding, delivering over 8× lossless acceleration for large‑model inference and already being usable in SGLang, vLLM, and Apple Silicon (MLX) frameworks, effectively offering a “free lunch” for deployment teams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
