FlowDCN: Efficient Arbitrary-Resolution Image Generation via Groupwise Multi‑Scale Deformable Convolution
FlowDCN introduces Groupwise‑MSDCN, a sparse deformable convolution that replaces attention, enabling efficient arbitrary‑resolution image generation with linear complexity, fewer parameters and FLOPs, and achieving state‑of‑the‑art FID scores on ImageNet while requiring far fewer training steps.
1. Background
In e‑commerce creative image generation, the ability of text‑to‑image models to produce outputs at arbitrary resolutions is crucial for downstream tasks such as background synthesis, controllable generation, and model scaling. Existing diffusion models based on UNet or Transformer are computationally heavy, converge slowly, and struggle with arbitrary‑size inference, often yielding global semantic inconsistencies. A lightweight, efficient, and flexible backbone is therefore an important research direction.
This work, in collaboration with Prof. Li‑min Wang’s group at Nanjing University, introduces a sparse‑computing deformable convolution variant called Groupwise‑Multi‑Scale Deformable Convolution (Groupwise‑MSDCN). Compared with quadratic‑complexity attention, the deformable convolution has linear complexity, higher efficiency, and stronger dynamic modeling capability. By stacking Groupwise‑MSDCN blocks, we build FlowDCN, a model capable of generating images at any resolution with fewer parameters and FLOPs than mainstream Transformer‑based generators.
2. Core Concepts
2.1 Characteristics of DCN (Deformable Convolution Network)
Sparse and efficient: Unlike attention whose cost grows quadratically with token count, DCN’s sparse sampling yields lower latency on high‑resolution inputs.
Native arbitrary‑resolution inference: DCN can adapt to varying input sizes without extra training.
Strong spatial understanding: Dynamic weights enable adaptive sampling, offering superior spatial representation.
2.2 DCN Architecture Evolution
DCNv2 uses only a deformable field for adaptive sampling. DCNv3 adds dynamic weights, and DCNv4 further optimizes back‑propagation operators, accelerating training.
Both DCNv3/DCNv4 predict a Deformable Field and a Dynamic Field via linear layers (weights W, bias b). The Dynamic Field combines feature positions and convolution offsets to determine sampling locations, which are then aggregated with dynamic weights, yielding a sparser alternative to attention.
2.3 Linear‑based Flow Matching
Linear‑based flow matching (rectified flow) mixes Gaussian noise with clean samples to obtain noisy inputs xt . The network predicts the velocity field, and sampling can be performed with Euler or Heun solvers. This work uses the linear flow variant for fair comparison.
3. Methodology
3.1 Groupwise Multi‑Scale Deformable Convolution (Groupwise‑MSDCN)
The Deformable Field is decoupled into scale and direction components, each predicted by separate linear heads. This enables each group to have its own scale prior, achieving inter‑group multi‑scale behavior.
Multi‑scale aggregation, a common technique in CV, significantly boosts performance. Unlike DCNv3/4, which share a single dilation across groups, Groupwise‑MSDCN provides distinct scales per group.
3.2 Overall FlowDCN Architecture
Inspired by DiT, we replace the attention block with MSDCN and adopt RMSNorm and SwiGLU from Llama. Ablation shows that even without RMSNorm and SwiGLU, FlowDCN still outperforms SiT/DiT.
3.3 Arbitrary‑Resolution Inference
Directly feeding arbitrary resolutions can cause global semantic inconsistency due to limited receptive field. We propose a maximum‑scale adjustment algorithm that enlarges the receptive field of selected blocks (parameter S_max ) based on the target resolution, improving semantic coherence.
4. Experimental Results
All training settings follow the open‑source SiT configuration, without multi‑scale training or log‑norm tricks; inference uses a simple CFG.
FlowDCN achieves linear time and memory complexity, delivering superior visual quality even with very few sampling steps. On ImageNet‑256, after only 1.5 M training steps, FlowDCN‑XL/2 (Euler solver, classifier‑free guidance) reaches FID 2.13 and sFID 4.30, matching state‑of‑the‑art results. On ImageNet‑512, after 100 K fine‑tuning steps, it attains FID 2.44 and sFID 4.53, surpassing existing diffusion models.
Compared with SiT, FlowDCN requires only 20 % of the training iterations to achieve comparable performance. Extending training to 400 K steps further improves FID to 2.00 on the 256 benchmark.
Arbitrary‑resolution experiments show that FlowDCN without dedicated multi‑scale training already rivals specialized FiT models, and with FiT‑style multi‑scale training it significantly exceeds them.
5. Conclusion and Future Work
We introduced a novel approach that decouples scale and direction prediction in deformable convolutions, enabling groupwise multi‑scale blocks and an efficient arbitrary‑resolution generator. FlowDCN reduces parameters by 8 % and FLOPs by 20 % relative to DiT/SiT while delivering comparable or better image quality. Future directions include exploring hybrid CNN‑Transformer architectures for text‑to‑image foundation models.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.