Artificial Intelligence 21 min read

12 Kuaishou Breakthrough Papers at CVPR 2025: Video Generation, Diffusion & Multimodal AI

CVPR 2025 in Nashville will feature 12 Kuaishou papers spanning large‑scale video datasets, quality assessment, 3D/4D reconstruction, controllable generation, diffusion scaling laws, multimodal simulation, and novel benchmarks, highlighting the company's cutting‑edge contributions to video AI research.

Kuaishou Large Model
Kuaishou Large Model
Kuaishou Large Model
12 Kuaishou Breakthrough Papers at CVPR 2025: Video Generation, Diffusion & Multimodal AI

CVPR (IEEE Conference on Computer Vision and Pattern Recognition) is one of the top international conferences in computer vision. CVPR 2025 will be held from June 11 to June 15 in Nashville, Tennessee, USA, receiving 13,008 valid paper submissions and accepting 2,878 papers, an overall acceptance rate of about 22.1%.

Kuaishou has 12 papers selected for CVPR 2025, covering video quality assessment, multimodal dataset construction and benchmarking, dynamic 3D avatar reconstruction, dynamic 4D scene simulation, video generation and enhancement, controllable video generation and editing, among other topics.

Paper 01: Koala-36M: A Large‑scale Video Dataset Improving Consistency between Fine‑grained Conditions and Video Content

Project address: https://koala36m.github.io/

Paper address: https://arxiv.org/pdf/2410.08260

Abstract: As visual generation advances, video datasets grow exponentially, but dataset quality is crucial for model performance. Koala‑36M is a large, high‑quality video dataset with precise temporal segmentation, detailed subtitles, and superior video quality. It improves fine‑grained condition‑to‑content consistency using linear classifiers for transition detection, structured subtitles (≈200 characters), and a Video Training Suitability Score (VTSS) to filter high‑quality videos, resulting in better training data and superior model performance.

Koala‑36M illustration
Koala‑36M illustration

Paper 02: KVQ: Boosting Video Quality Assessment via Saliency‑guided Local Perception

Paper address: https://arxiv.org/abs/2503.10259

Abstract: Video quality assessment (VQA) predicts perceived quality, but motion blur and specific distortions cause regional quality variations. KVQ introduces a saliency‑guided local perception framework that extracts visual saliency with window attention and adds local perception constraints, reducing reliance on neighboring texture. A new region‑level annotated dataset (LPVQ) is also provided. KVQ outperforms state‑of‑the‑art methods on five major VQA benchmarks.

KVQ illustration
KVQ illustration

Paper 03: StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper address: https://arxiv.org/pdf/2412.07744

Abstract: Existing video style control methods often produce videos far from the target style and suffer from content leakage. StyleMaster emphasizes fine‑grained texture extraction while preventing content leakage by filtering image patches based on prompt‑image similarity. It enhances global style extraction via contrastive learning on synthetic paired style data and introduces a lightweight motion adapter for static videos, enabling seamless application of image‑trained models to video generation. Experiments show significant improvements in style similarity, temporal consistency, and overall quality.

StyleMaster illustration
StyleMaster illustration

Paper 04: Towards Precise Scaling Laws for Video Diffusion Transformers

Paper address: https://arxiv.org/pdf/2411.17470

Abstract: Training video diffusion transformers is costly; determining optimal model size and hyper‑parameters under limited budget is critical. This work systematically analyzes scaling laws for video diffusion models, revealing sensitivity to learning rate and batch size. A new scaling law predicts optimal hyper‑parameters for any model size and compute budget, achieving a 40.1% reduction in inference cost while maintaining performance, and provides a generalized relationship among validation loss, model size, and compute.

Scaling laws illustration
Scaling laws illustration

Paper 05: Unleashing the Potential of Multi‑modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Paper address: https://arxiv.org/pdf/2411.14423

Abstract: Realistic 4D dynamic scene simulation requires accurate material properties and physical interaction modeling. PhysFlow combines multimodal foundation models with video diffusion, using multimodal cues to initialize material parameters and a differentiable MPM solver with flow‑guided diffusion to refine them, achieving high‑fidelity simulation without relying on rendering or SDS losses.

PhysFlow illustration
PhysFlow illustration

Paper 06: CoMM: A Coherent Interleaved Image‑Text Dataset for Multimodal Understanding and Generation

Paper address: https://arxiv.org/abs/2406.10462

Abstract: Interleaved image‑text generation is a key multimodal task, but existing datasets lack narrative coherence and style consistency. CoMM is a high‑quality interleaved dataset built from diverse sources, filtered through multi‑view strategies to ensure textual progression, image consistency, and semantic alignment. Experiments show CoMM significantly improves multimodal large language models' few‑shot performance and supports four new evaluation tasks.

CoMM illustration
CoMM illustration

Paper 07: Libra‑Merging: Importance‑redundancy and Pruning‑merging Trade‑off for Acceleration Plug‑in in Large Vision‑Language Models

Paper address: https://cvpr.thecvf.com/virtual/2025/poster/34817

Abstract: Large vision‑language models (LVLMs) face high inference cost. Libra‑Merging introduces a position‑driven tag identification mechanism balancing importance and redundancy, and an importance‑guided partition‑merge strategy to avoid tag distortion. Experiments on LLaVA series reduce FLOPs to 37% of the original with negligible performance loss, and cut GPU training time by 57% for video understanding tasks.

Libra‑Merging illustration
Libra‑Merging illustration

Paper 08: GPAvatar: High‑fidelity Head Avatars by Learning Efficient Gaussian Projections

Paper address: https://openaccess.thecvf.com//content/CVPR2025/papers/Feng_GPAvatar_High-fidelity_Head_Avatars_by_Learning_Efficient_Gaussian_Projections_CVPR_2025_paper.pdf

Abstract: Existing radiance‑field avatar methods rely on explicit priors or neural implicit representations, limiting fidelity, efficiency, and memory. GPAvatar learns a linear projection from high‑dimensional Gaussians to 3D space, enabling efficient point‑based rendering with adaptive encryption for expressive regions. It outperforms SOTA in rendering quality, speed, and memory usage on three datasets.

GPAvatar illustration
GPAvatar illustration

Paper 09: PatchVSR: Breaking Video Diffusion Resolution Limits with Patch‑wise Video Super‑Resolution

Paper address: https://openaccess.thecvf.com//content/CVPR2025/papers/Du_PatchVSR_Breaking_Video_Diffusion_Resolution_Limits_with_Patch-wise_Video_Super-Resolution_CVPR_2025_paper.pdf

Abstract: Video diffusion models excel in VSR but face high computation and fixed output resolution. PatchVSR introduces a dual‑stream adapter with block‑wise processing and a block position encoding mechanism, enabling 4K super‑resolution from a 512×512 base model while maintaining efficiency and visual consistency across blocks.

PatchVSR illustration
PatchVSR illustration

Paper 10: SeriesBench: A Benchmark for Narrative‑Driven Drama Series Understanding

Paper address: https://stan-lei.github.io/KwaiMM-Dialogue/paper2-seriesbench.html

Abstract: Existing VideoQA focuses on isolated clips and visual elements. SeriesBench introduces 105 narrative‑driven series videos covering 28 professional tasks that require deep story understanding. A novel narrative reasoning framework (PC‑DCoT) achieves significant performance gains on this benchmark.

SeriesBench illustration
SeriesBench illustration

Paper 11: SketchVideo: Sketch‑based Video Generation and Editing

Project address: http://geometrylearning.com/SketchVideo/

Paper address: https://arxiv.org/pdf/2503.23284

Abstract: Controlling global layout and geometry in text‑to‑video generation remains challenging. SketchVideo enables spatial and motion control via user‑drawn sketches on one or two keyframes, propagating conditions across frames with cross‑frame attention. A video insertion module ensures seamless editing while preserving unedited regions, achieving superior controllable video synthesis.

SketchVideo illustration
SketchVideo illustration

Paper 12: STDD: Spatio‑Temporal Dual Diffusion for Video Generation

Paper address: https://cvpr.thecvf.com/virtual/2025/poster/35022

Abstract: Existing video diffusion methods focus on spatial diffusion only. STDD extends diffusion to a spatio‑temporal joint process, deriving analytically tractable forward and reverse processes that propagate information across frames, improving temporal consistency. Accelerated sampling reduces inference cost, and experiments show superior performance on video generation, prediction, and text‑to‑video tasks.

STDD illustration
STDD illustration

CVPR 2025 Tutorial: From Video Generation to World Models

On June 11, Kuaishou and NTU MMLab will host a CVPR tutorial titled "From Video Generation to World Models". The talk, delivered by Wan Pengfei (Head of Visual Generation & Interaction at Kuaishou), will discuss how video generation can serve as a foundation for next‑generation intelligent systems that understand, interact with, and model the real world. More information is available at https://world-model-tutorial.github.io/.

multimodal AIcomputer visionvideo generationdiffusion modelslarge-scale datasets
Kuaishou Large Model
Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.