How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

Background: Speculative Decoding

Large language models generate text token by token, which becomes the primary bottleneck regardless of GPU power. Speculative decoding mitigates this by letting a smaller draft model quickly guess a sequence of tokens, which the large model then verifies in a single forward pass; correct guesses speed up inference, while incorrect ones are simply corrected.

DFlash – Replacing Autoregressive Drafts with Block Diffusion

DFlash (Block Diffusion for Flash Speculative Decoding) from Z Lab introduces a lightweight block diffusion model that generates an entire token block (block size = 16) in one forward pass, eliminating the “slow guessing” problem of traditional draft models.

The key technique is KV Injection : hidden features from multiple layers of the target model are fused into the draft model’s KV cache, enabling high‑quality predictions from the draft.

Benchmark results (T = 0.0) show speedups of:

HumanEval: 6.09× (Qwen3‑30B‑MoE)

MATH‑500: 6.17× (Qwen3‑8B)

GSM8K: 5.20× (Qwen3‑8B)

AIME24: 5.91× (Qwen3‑8B)

MBPP: 4.75× (Qwen3‑8B)

Compared with the popular EAGLE‑3 approach (≈2‑3×), DFlash is about 2.5× faster, reaching 5‑6× acceleration even in sampling mode (Temperature = 1) where many methods degrade.

DDTree – Extending DFlash with a Draft Tree

DDTree (Diffusion Draft Tree), built on DFlash by Liran Ringel, constructs a probability‑tree of multiple promising draft paths using a best‑first heap algorithm, then validates the entire tree in a single forward pass of the target model.

Four‑step DDTree workflow:

Block diffusion generates probability distributions for L positions.

Best‑first heap builds an optimal draft tree under a node budget B.

Tree attention compiles the tree into the target model’s input.

Verification traverses the tree: matching nodes continue, mismatches trigger a bonus token for the next round.

The method has a mathematical guarantee that the constructed tree maximizes the expected accepted length under the draft model’s distribution.

On HumanEval (T = 0.0), DDTree lifts DFlash’s 6.09× speedup to 8.22×, an additional 2.13× gain, while remaining completely lossless—the output distribution matches that of unaccelerated decoding.

Supported Models and Integration

DFlash draft models are available for several mainstream LLMs, including Kimi‑K2.5, Qwen3.5‑4B/9B/27B, Qwen3.5‑35B‑A3B, Qwen3‑Coder‑30B‑A3B, and LLaMA‑3.1‑8B‑Instruct. Drafts for larger models such as Qwen3.5‑122B, 397B, and GLM‑5.1 are in progress.

Integration commands:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-35B-A3B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
    --tp-size 1 --attention-backend trtllm_mha
vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}'
pip install -e ".[mlx]"

DDTree benchmark can be run with:

git clone https://github.com/liranringel/ddtree
cd ddtree
pip install -r requirements.txt
bash run_benchmark.sh
python3 plot_results.py

Conclusion

The DFlash + DDTree combination represents the next stage of speculative decoding, delivering over 8× lossless acceleration for large‑model inference and already being usable in SGLang, vLLM, and Apple Silicon (MLX) frameworks, effectively offering a “free lunch” for deployment teams.

DFlash + DDTree 加速流水线
DFlash + DDTree 加速流水线
DDTree 方法流水线
DDTree 方法流水线
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Speculative Decodingaccelerationlarge language model inferenceDDTreeDFlash
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.