Artificial Intelligence 11 min read

Alibaba Cloud's Wanxiang 2.1: Open‑Source Dual‑Version Visual Generation Model with Full‑Scale Capabilities

Wanxiang 2.1, an open‑source visual generation model released by Alibaba Cloud, offers a 140‑billion‑parameter professional version and a 13‑billion‑parameter consumer‑grade version, delivering SOTA performance across multiple benchmarks, supporting diverse video generation tasks, and employing advanced DiT‑based architecture, 3D VAE, and efficient distributed training strategies.

DataFunTalk

Feb 26, 2025

Alibaba Cloud's Wanxiang 2.1: Open‑Source Dual‑Version Visual Generation Model with Full‑Scale Capabilities

Alibaba Cloud announced the open‑source release of Wanxiang 2.1, a visual generation foundation model that comes in two parameter scales: a 140 billion‑parameter professional version for high‑quality output and a 13 billion‑parameter version that runs on consumer‑grade GPUs with fast inference.

The models achieve state‑of‑the‑art results, surpassing existing open‑source and commercial solutions on benchmarks such as VBench, where the 140 B model scores 86.22% and outperforms Sora, Luma, and Pika. The 13 B model generates 480p video using only 8.2 GB VRAM, completing a 5‑second clip on an RTX 4090 in about four minutes without quantization.

Wanxiang 2.1 supports multiple generation tasks, including text‑to‑video, image‑to‑video, video editing, text‑to‑image, and audio synthesis, and is the first open‑source model capable of rendering both Chinese and English text directly within videos.

The model incorporates a powerful video VAE (Wan‑VAE) that can encode and decode arbitrarily long 1080p videos while preserving temporal information, and its evaluation covers 14 major dimensions and 26 sub‑dimensions, achieving top scores in five categories such as complex motion, physical realism, and visual quality.

Architecturally, Wanxiang 2.1 builds on the mainstream video DiT framework with Full‑Attention, uses a linear‑noise‑trajectory Flow Matching training paradigm, and integrates a self‑developed causal 3D VAE, scalable pre‑training strategies, and large‑scale data pipelines. Training employs a combination of DP, FSDP, RingAttention, and Ulysses parallelism, along with Context Parallelism for long sequences, achieving near‑linear scaling on multi‑GPU setups.

Data preparation involved a four‑step cleaning pipeline to construct a high‑quality, diverse image‑video dataset. Memory optimization combines layer‑wise offloading, fine‑grained gradient checkpointing, and PyTorch’s memory manager to reduce fragmentation.

During inference, the model uses CP‑based distributed acceleration and model‑splitting techniques to lower latency on multi‑GPU systems. Training stability is enhanced by Alibaba Cloud’s intelligent scheduling, slow‑node detection, and auto‑recovery, yielding a 98.23% successful restart rate with an average restart time of 39 seconds.

Overall, the open‑source release of Wanxiang 2.1 marks a milestone for Alibaba Cloud, completing a full‑modal, full‑scale open‑source ecosystem that spans from large language models to multimodal visual generators.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning AI model visual generation

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.