How FlashAR Achieves 22.9× Speedup with Only 0.05% of Training Data
FlashAR transforms pretrained autoregressive image models into highly parallel generators, delivering up to 22.9× end-to-end speedup while using just 0.05% of the original training data and preserving generation quality, thanks to intermediate branching, a learnable fusion gate, and a two-stage adaptation process.
Recent advances in next‑token prediction have extended large‑language‑model techniques to images, leading to autoregressive (AR) image generators such as PixelCNN, iGPT, Parti, Emu3.5, LlamaGen, Lumina‑mGPT, and GLM‑Image whose quality rivals diffusion models.
AR models suffer from a fundamental latency bottleneck: standard raster‑scan decoding proceeds left‑to‑right, top‑to‑bottom, emitting one token per step. Generating a 512×512 image therefore requires 1,024 sequential forward passes and exceeds two minutes on a single GPU, with latency growing linearly with resolution.
Existing acceleration attempts fall into three categories. Redesigning the generation paradigm (e.g., VAR’s next‑scale prediction, NAR’s neighbor prediction, PAR’s grouped decoding) cuts the number of steps but demands training from scratch, incurring high cost. Discrete‑diffusion adaptation (e.g., DiDA used by Emu3.5) changes the original prediction target, creating a train‑inference mismatch that degrades quality (our reproduction shows a noticeable GenEval drop). Speculative decoding adds a plug‑in without extra training, yet its speed gains are limited by the acceptance rate of the draft model.
The open question is whether a pretrained AR model can be turned into a highly parallel generator without retraining or altering its original objective.
Researchers from Zhejiang University and the University of Adelaide propose FlashAR, a lightweight post‑training acceleration framework. Using only 0.05% of the original training data (≈80 k images), FlashAR converts a pretrained AR model into a parallel generator (Emu3.5‑34B → Emu3.5‑34B‑Flash) and achieves a maximum of 22.9× end‑to‑end speedup.
The core insight is that images have an inherent 2‑D structure. By adding a vertical‑direction prediction head, horizontal and vertical heads can operate in parallel, reducing decoding steps from H×W to H+W‑1. For a 512×512 image (16×16 down‑sampling factor) the steps drop from 1,024 to 63.
FlashAR consists of three key components:
Intermediate Branching : a vertical head is branched from an intermediate layer rather than the final layer, because intermediate features retain richer spatial information. Linear‑probing experiments confirm that final‑layer features are less suitable for vertical prediction.
Learnable Fusion Gate : a lightweight MLP fuses horizontal and vertical predictions at each spatial location, avoiding the blur caused by simple averaging.
Two‑Stage Adaptation : stage 1 freezes the backbone and trains only the vertical head; stage 2 jointly fine‑tunes both backbone and vertical head, improving stability and data efficiency.
During inference FlashAR deploys a hardware‑aware pipeline: FlexAttention dynamically compiles a sparse 2‑D neighbor‑attention mask and batches KV‑cache updates, turning theoretical parallelism into real‑world acceleration.
Experimental results on Emu3.5‑Image‑34B show that with 0.05% data the generation time for a 512×512 image drops from 130.10 s to 5.68 s (22.9× speedup). GenEval score declines by only 0.19 points (80.48 → 80.29); color (+1.59) and position (+7.00) sub‑scores even improve, whereas BlockDiffusion under the same setting falls to 73.83. On ImageNet‑256 conditional generation, FlashAR outperforms BlockDiffusion across B/L/XL/XXL model scales. Additional observations: FlashAR‑L achieves an IS of 289.0, surpassing the from‑scratch NAR‑L (263.9); FlashAR‑B reaches 447.2 img/s throughput, exceeding NAR‑B (419.7 img/s); training requires only 25 epochs, one‑third of BlockDiffusion’s data budget.
The authors summarize FlashAR’s advantages: (1) no head‑training required—existing pretrained AR models are reused; (2) extreme data efficiency—only 0.05% of original data; (3) quality preservation—generation metrics remain on par or improve; (4) strong framework generality—validated on LlamaGen (120 M‑1.4 B) and Emu3.5 (34 B); (5) substantial real‑world speedup—up to 22.9×.
FlashAR demonstrates that carefully designed post‑training adaptation can convert autoregressive models into highly parallel generators while keeping the original training objective essentially unchanged.
Paper title: FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
Paper URL: https://arxiv.org/abs/2605.09430
Code repository: https://github.com/lxazjk/Emu3.5-FlashAR
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
