AnyFlow: Generate High‑Quality Video in 4 Steps with Unlimited Sampling Improvement
AnyFlow introduces a flow‑map distillation framework that enables video diffusion models to produce high‑quality results in just four steps while continuously improving with additional sampling steps, supporting both causal and bidirectional architectures up to 14 B parameters and allowing downstream fine‑tuning.
1. Background: Fast and Scalable Video Generation
Video diffusion models can generate high‑quality clips but typically require many sampling steps, leading to high inference cost. Existing few‑step methods use consistency distillation to produce results in four steps, yet they are optimized for a fixed step count; increasing steps does not guarantee quality gains and may even degrade performance, making it hard for users to switch between quick previews and high‑quality outputs.
AnyFlow addresses this limitation by asking whether a single model can generate good results in four steps and continue to improve when run for 16, 32, or more steps.
2. Method: Core Idea, Forward Training, and Backward Trajectory Decomposition
Core Idea: From Endpoint Mapping to Arbitrary‑Time Transitions
Traditional consistency distillation learns a direct mapping from an intermediate latent z_t to the final latent z_0. This works for few‑step generation but, when applied to a flow‑matching pretrained model, it alters the original sampling trajectory and weakens multi‑step scalability, as shown by the performance drop of rCM and Self‑Forcing when more steps are used.
AnyFlow replaces endpoint‑only learning with Flow Map Distillation : the model learns to map between any two time points, i.e., from z_t to z_r. Consequently, the model can make large jumps in few‑step mode and perform fine‑grained refinements when more steps are allocated, optimizing the entire sampling trajectory rather than a single step count.
Forward Training: Providing Any‑Step Initialization
AnyFlow first performs forward flow‑map training, converting a pretrained video diffusion model into a flow‑map model that learns transitions between arbitrary time pairs. This supplies a stable initialization for any‑step sampling.
However, the paper notes that forward training alone cannot fully close the train‑test gap. During inference the model rolls out its own generated states, while forward training only learns local mappings on the teacher’s trajectory, leading to discretization error in few‑step sampling and exposure bias in causal generation.
Therefore, AnyFlow adds On‑Policy Distillation (OPD) to correct the model on its own rollout trajectory.
Backward Trajectory Decomposition: Flow‑Map Backward Simulation
During OPD the model must generate its own sampling states, but full rollout is computationally expensive. AnyFlow leverages the compositional property of flow maps to decompose a long Euler trajectory into shortcut transitions, e.g., z_T → z_t → z_r → z_0. This yields two benefits: (1) test‑time inference can reuse the original Euler trajectory without extra consistency sampling, and (2) the decomposition adapts to different step sizes, reducing the cost of multi‑step rollout.
3. Experiments: Scaling from 1.3 B to 14 B Parameters
The authors evaluate AnyFlow on both bidirectional and causal video diffusion backbones, covering model sizes from 1.3 B to 14 B parameters, demonstrating that the approach scales to large models.
Causal Generation: AnyFlow‑FAR‑Wan2.1‑14B
Combined with the FAR backbone, AnyFlow‑FAR generates high‑quality text‑to‑video (T2V) results with only 4 NFEs, and quality continues to rise as more sampling steps are added. Visual comparisons show superior motion stability, subject clarity, and detail consistency over baseline few‑step methods, especially in challenging scenarios such as vehicle motion and running.
For image‑to‑video (I2V), AnyFlow‑FAR‑Wan2.1‑14B achieves 87.87 VBench‑I2V score with 4 NFEs, comparable to the 14B baseline that uses 50 × 2 NFEs, indicating strong first‑frame consistency and overall video quality even with extreme step reduction.
Bidirectional Generation: AnyFlow‑Wan2.1‑T2V‑14B
Applied to the bidirectional Wan2.1‑T2V backbone, AnyFlow‑Wan2.1‑T2V‑14B maintains high visual quality and natural motion under few‑step sampling and outperforms the rCM baseline in visual detail stability.
4. Fine‑Tuning on Downstream Data
Because the flow map retains multi‑granularity flow fields, the distilled AnyFlow model can be further fine‑tuned on domain‑specific video datasets while preserving its few‑step capability. This is valuable for vertical applications such as robotics, autonomous driving, or game scenes where identity, trajectory, or style consistency must be maintained.
Summary
AnyFlow proposes a new distillation framework for “any‑step” video generation. By learning full‑trajectory flow maps and correcting rollout errors through on‑policy flow‑map distillation, it achieves fast four‑step generation that continues to improve with more steps, works for both causal and bidirectional diffusion backbones, and scales up to 14 B parameters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
