Artificial Intelligence 6 min read

CineMaster: A 3D‑Aware and Controllable Framework for Cinematic Text‑to‑Video Generation

Researchers introduce CineMaster, a SIGGRAPH‑2025 paper presenting a 3D‑aware, controllable text‑to‑video generation framework that lets users define target objects and camera motions via an interactive workflow, enabling cinematic video creation with high‑quality, user‑directed results.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
CineMaster: A 3D‑Aware and Controllable Framework for Cinematic Text‑to‑Video Generation

Sora, Keling and other video generation models have shown impressive performance, allowing creators to produce high‑quality videos from text alone. However, traditional filmmaking involves directors arranging multiple moving targets and camera angles within a scene, a capability lacking in current text‑to‑video models.

To address this gap, the Keling research team proposes CineMaster, a movie‑grade text‑to‑video generation framework accepted to SIGGRAPH 2025. CineMaster enables users to control both 3D objects and camera motion through an interactive workflow, allowing professional‑level scene layout and motion specification.

Paper title: CineMaster: A 3D‑Aware and Controllable Framework for Cinematic Text‑to‑Video Generation Paper URL: https://arxiv.org/abs/2502.08639 Project page: https://cinemaster-dev.github.io/

1. Joint Object‑Camera Control

a) Object‑camera joint control (illustrated with GIFs). b) Object motion control. c) Camera motion control. CineMaster can generate videos that follow fine‑grained multimodal control signals, supporting large‑scale object and camera movements.

2. CineMaster Framework

The framework follows a two‑stage workflow:

Stage 1: Users interactively adjust 3D bounding boxes and camera positions in a 3D space, exporting camera trajectories and per‑frame depth maps as conditioning signals.

Stage 2: A semantic layout ControlNet integrates object motion signals and class labels, while a Camera Adapter incorporates global camera motion, enabling precise control over each target’s movement.

3. Training Data Construction Pipeline

Enhance open‑vocabulary object detection (Grounding DINO) with Qwen2‑VL descriptions and perform video instance segmentation using SAM v2.

Estimate absolute depth with DepthAnything V2.

Compute 3D bounding boxes from depth‑projected masks at the frame with maximum object mask.

Use Spatial Tracker for 3D point tracking and obtain camera trajectories via MonST3R.

4. Comparison Results

Compared with baseline methods, CineMaster uniquely associates motion conditions with specific targets and decouples object and camera motion, producing higher‑quality videos that satisfy textual prompts and control signals.

5. Conclusion

The authors aim to provide powerful 3D‑aware controllable video generation, allowing users to act like professional directors. They designed an interactive 3D workflow, built a multimodal conditional video generation model, and created a data pipeline for extracting 3D control signals from arbitrary videos, offering valuable insights for the research community.

computer visiontext-to-videocontrollable generation3D-awareAI videoCineMaster
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.