AniSora: An Integrated System for Anime Video Generation with Data Flywheel, Controllable Diffusion Models, and Evaluation Benchmark
AniSora combines a 10‑million‑pair anime text‑video dataset, a controllable diffusion‑transformer with temporal‑mask conditioning for text‑to‑video, interpolation and region‑guided animation, and a 948‑video benchmark, delivering industry‑leading character and motion consistency and already powering low‑cost dynamic‑comic production for multiple IPs.
Paper arXiv link: https://arxiv.org/abs/2412.10255. Project homepage: https://github.com/bilibili/Index-anisora.
Article Overview Animation is a crucial domain in the film industry. While recent video generation models such as Sora, 可灵, and 智谱清影 have succeeded on real‑world video, they fall short on animated content, which has unique artistic styles, exaggerated motions, and non‑physical rules, making evaluation especially challenging.
We propose AniSora, a comprehensive system designed for anime video generation. AniSora consists of a data flywheel (>10 M high‑quality text‑video pairs), a controllable diffusion‑transformer model equipped with a temporal‑mask module that enables key animation functions (text‑to‑video, video interpolation, and region‑guided animation), and a dedicated evaluation benchmark built from 948 diverse anime videos. Automatic VBench metrics and double‑blind human studies demonstrate industry‑leading consistency in character and motion.
Business Impact The model powers dynamic‑comic (动态漫) production, already used in more than ten proprietary IP works, reducing cost, lowering entry barriers, and achieving millions of views across platforms.
Dataset Construction We built an anime dataset by processing 1 M raw anime clips. Scene detection splits videos into segments, followed by four filtering rules (text coverage, optical‑flow score, aesthetic score, and duration). Approximately 10 % of clips (over 10 M segments) pass the filters. A benchmark of 948 videos (857 2D, 91 3D) with manually curated action labels and prompt texts (generated by Qwen‑VL2 and manually corrected) was created.
Method
1. Diffusion‑Transformer (DiT) Backbone – We adopt the DiT‑based text‑to‑video diffusion model. A 3D Casual VAE compresses video spatio‑temporally, Patchify splits videos into spatial‑temporal patches, and a 3D full‑attention mechanism models long‑range dependencies. Training uses v‑prediction loss for stability.
2. Temporal‑Mask Conditional Model – A mask indicates guided frames; the model receives noise, encoded mask, T5 text features, and guided frame embeddings, enabling tasks such as key‑frame interpolation and region‑controlled animation.
3. Motion‑Region Control – Foreground masks are generated per frame using a foreground detector, merged across frames, and used to condition the diffusion model via LoRA (0.27 B parameters), allowing precise control of moving regions.
4. Supervised Fine‑Tuning – Starting from a weight pretrained on 35 M video clips, we fine‑tune on the anime dataset with a weak‑to‑strong schedule (8 fps → 16 fps, 480p → 720p) and additional high‑quality data (1 M ultra‑clean clips). Sub‑tasks such as subtitle removal, multi‑resolution temporal training, and multi‑task learning (image + video) improve quality and style generalization.
Experimental Results
Automatic evaluation uses eight VBench dimensions (motion smoothness, aesthetic quality, imaging quality, subject consistency, text‑video consistency, etc.) and a motion‑amplitude model. AniSora outperforms six recent methods on subject consistency and motion smoothness, and is competitive on other metrics. Human double‑blind evaluation across six criteria (visual smoothness, action, attractiveness, text‑video alignment, image‑text alignment, protagonist consistency) shows AniSora surpasses competitors in most aspects.
Motion‑region control is quantitatively validated on 200 samples, achieving high mask precision (percentage of generated motion pixels inside the mask).
Analysis reveals 3D anime samples achieve higher scores than 2D, likely due to physics‑based rendering similarity to real‑world pretrained models. Multi‑task training with a few style‑specific images also boosts stability and visual quality for unique artistic styles.
Business Scenario Dynamic‑comic production benefits from low‑cost, fast AI generation (3–5 minutes per minute of content) compared to traditional pipelines (¥1,000–10,000 per minute, weeks of work). Over ten IP titles have already leveraged this capability, reaching tens of millions of views and attracting ~6 K users at industry events.
Future Work Plans include an automatic scoring system tailored to animation perception, multimodal control (camera motion, trajectories, audio), and reinforcement‑learning‑based fine‑tuning to overcome limited high‑quality anime data.
References (selected) – [1] PySceneDetect, [2] CRAFT text detection, [3] RAFT optical flow, [4] Aesthetic predictor, [5] Qwen‑VL2, [6] DiT diffusion models, [7] CogVideoX, [8] Salient object detection, [9] VBench benchmark.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.