Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025
The article highlights CVPR 2025’s acceptance statistics and showcases twelve cutting‑edge video‑generation papers from Kuaishou, spanning datasets, quality assessment, style control, scaling laws, 4D simulation, interleaved image‑text data, vision‑language acceleration, high‑fidelity avatars, patch‑wise super‑resolution, narrative‑driven benchmarks, sketch‑based editing, and spatio‑temporal diffusion, each with links and abstracts.
CVPR (IEEE Conference on Computer Vision and Pattern Recognition) is a top academic conference in computer vision. CVPR 2025 will be held June 11‑15 in Nashville, Tennessee, USA, receiving 13,008 valid submissions and accepting 2,878 papers (22.1% acceptance rate).
Kuaishou has twelve papers accepted at CVPR 2025, covering video quality assessment, multimodal dataset construction, dynamic 3D avatar reconstruction, dynamic 4D scene simulation, video generation and enhancement, controllable video generation and editing, and more.
Paper 01: Koala-36M: A Large‑scale Video Dataset Improving Consistency between Fine‑grained Conditions and Video Content
Project URL: https://koala36m.github.io/
Paper URL: https://arxiv.org/pdf/2410.08260
Abstract: As video generation technology advances, the scale of video datasets grows exponentially, and dataset quality is crucial for model performance. Koala‑36M is a large‑scale, high‑quality video dataset with accurate temporal segmentation, detailed subtitles (average 200 characters), and superior video quality. It improves fine‑grained condition‑video consistency using a linear classifier for probability distribution analysis, structured subtitles, and a Video Training Suitability Score (VTSS) to filter high‑quality videos. Experiments show the processing pipeline significantly enhances dataset quality.
Paper 02: KVQ: Boosting Video Quality Assessment via Saliency‑guided Local Perception
Paper URL: https://arxiv.org/abs/2503.10259
Abstract: Video quality assessment (VQA) predicts perceived quality, which is increasingly important for streaming platforms. KVQ introduces a saliency‑guided local perception framework that extracts visual saliency with window attention and applies local perception constraints to reduce reliance on neighboring information. A new region‑level annotated dataset (LPVQ) is constructed, and KVQ outperforms state‑of‑the‑art methods on five VQA benchmarks.
Paper 03: StyleMaster: Stylize Your Video with Artistic Generation and Translation
Paper URL: https://arxiv.org/pdf/2412.07744
Abstract: Existing video style‑transfer methods often produce results far from the target style and suffer from content leakage. StyleMaster emphasizes local texture extraction while preserving content by filtering image patches based on prompt‑image similarity. Global style is enhanced via contrastive learning on a generated paired‑style dataset. A lightweight motion adapter trained on static videos bridges the gap between image and video, enabling high‑fidelity, temporally consistent stylized videos and supporting ControlNet‑based style transfer.
Paper 04: Towards Precise Scaling Laws for Video Diffusion Transformers
Paper URL: https://arxiv.org/pdf/2411.17470
Abstract: Training video diffusion transformers is costly; precise scaling laws are needed to choose model size and hyper‑parameters under limited compute. This work demonstrates the existence of scaling laws for video diffusion models and reveals higher sensitivity to learning rate and batch size compared to language models. A new scaling law predicts optimal hyper‑parameters, reducing inference FLOPs by 40.1% under a 1e10 TFlops budget while maintaining performance.
Paper 05: Unleashing the Potential of Multi‑modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation
Paper URL: https://arxiv.org/pdf/2411.14423
Abstract: Accurate 4D dynamic scene simulation requires diverse material properties and precise physical interaction modeling. PhysFlow combines multimodal foundation models with video diffusion to identify material types, initialize parameters via image queries, and generate fine‑grained scene representations using 3D Gaussian splats. Differentiable MPM and flow‑guided diffusion optimize material parameters without relying on rendering or SDS losses, achieving realistic dynamic interactions.
Paper 06: CoMM: A Coherent Interleaved Image‑Text Dataset for Multimodal Understanding and Generation
Paper URL: https://arxiv.org/abs/2406.10462
Abstract: Interleaved image‑text generation demands coherent, consistent, and well‑aligned sequences. CoMM provides a high‑quality interleaved dataset filtered through multi‑view strategies using pretrained models to ensure textual development, image consistency, and semantic alignment. Extensive evaluations show CoMM significantly improves multimodal large language models’ contextual learning and supports four new tasks for comprehensive assessment.
Paper 07: Libra‑Merging: Importance‑redundancy and Pruning‑merging Trade‑off for Acceleration Plug‑in in Large Vision‑Language Model
Paper URL: https://cvpr.thecvf.com/virtual/2025/poster/34817
Abstract: Large vision‑language models (LVLMs) face high inference costs. Libra‑Merging introduces a position‑driven token identification mechanism to balance importance and redundancy, and an importance‑guided grouping‑merging strategy that preserves key information while avoiding distortion. Experiments on LLaVA models reduce FLOPs to 37% with negligible performance loss and cut GPU training time by 57%.
Paper 08: GPAvatar: High‑fidelity Head Avatars by Learning Efficient Gaussian Projections
Paper URL: https://openaccess.thecvf.com//content/CVPR2025/papers/Feng_GPAvatar_High-fidelity_Head_Avatars_by_Learning_Efficient_Gaussian_Projections_CVPR_2025_paper.pdf
Abstract: Existing radiance‑field avatar methods rely on explicit priors or neural implicit representations, limiting fidelity and efficiency. GPAvatar proposes a Gaussian point‑rendering approach that embeds high‑dimensional Gaussians (position + expression) and learns a linear projection back to 3D space, capturing arbitrary poses and expressions. An adaptive encryption strategy allocates Gaussians to regions with large expression changes, improving facial detail while reducing memory and computation.
Paper 09: PatchVSR: Breaking Video Diffusion Resolution Limits with Patch‑wise Video Super‑Resolution
Paper URL: https://openaccess.thecvf.com//content/CVPR2025/papers/Du_PatchVSR_Breaking_Video_Diffusion_Resolution_Limits_with_Patch-wise_Video_Super-Resolution_CVPR_2025_paper.pdf
Abstract: Video diffusion models excel at generation but are inefficient for full‑resolution video super‑resolution (VSR). PatchVSR introduces a dual‑stream adapter that processes local patches for detail preservation and a global branch for contextual semantics. Block position encoding and multi‑block joint modulation enable 4K SR from a 512×512 base model with high efficiency.
Paper 10: SeriesBench: A Benchmark for Narrative‑Driven Drama Series Understanding
Paper URL: https://stan-lei.github.io/KwaiMM-Dialogue/paper2-seriesbench.html
Abstract: Current VideoQA benchmarks focus on isolated clips. SeriesBench provides 105 narrative‑driven series videos and 28 tasks requiring deep narrative reasoning. The proposed PC‑DCoT framework achieves significant gains on this multi‑video QA benchmark.
Paper 11: SketchVideo: Sketch‑based Video Generation and Editing
Project URL: http://geometrylearning.com/SketchVideo/
Paper URL: https://arxiv.org/pdf/2503.23284
Abstract: SketchVideo enables spatial and motion control of video generation and fine‑grained editing via sketch inputs. A memory‑efficient control structure predicts residual features for DiT blocks, and cross‑frame attention propagates sparse sketch conditions across frames. A video insertion module ensures seamless integration of edited content.
Paper 12: STDD: Spatio‑Temporal Dual Diffusion for Video Generation
Paper URL: https://cvpr.thecvf.com/virtual/2025/poster/35022
Abstract: STDD extends diffusion models to a spatio‑temporal joint diffusion process, analytically deriving forward and reverse processes and proposing accelerated sampling. By allowing information flow from previous frames, STDD improves temporal consistency and outperforms existing methods on video generation, prediction, and text‑to‑video tasks.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.