DVIS: Decoupled Framework that Sets New SOTA in Video Instance Segmentation
DVIS introduces a decoupled video instance segmentation framework that splits the task into segmentation, tracking, and refinement modules, achieving state-of-the-art performance across VIS, VPS, and VSS benchmarks while maintaining low computational overhead, and demonstrates robustness in both online and offline settings.
Abstract
Video segmentation extends image segmentation to simultaneously segment, detect, and track all objects in a video, offering temporally stable and accurate results crucial for editing, autonomous driving, and surveillance.
Background
Transformer‑based models such as DETR have advanced object detection and instance segmentation. VisTR applied Transformers to video instance segmentation (VIS), making Transformer methods mainstream in this field.
Online vs Offline Methods
Online methods use current and past frames for real‑time tasks like autonomous driving, while offline methods can leverage any frame for tasks like video editing.
Limitations of Existing SOTA
Current online SOTA (e.g., MinVIS, IDOL) perform image segmentation first then associate instances frame‑by‑frame, failing to exploit video context. Offline SOTA (e.g., SeqFormer, Mask2Former‑VIS, VITA, IFC) use tightly coupled end‑to‑end networks but struggle with long videos and heavy occlusions, as shown by tracking errors in Mask2Former‑VIS.
Proposed DVIS Framework
DVIS decouples VIS into three sub‑tasks: segmentation, tracking, and temporal refinement, implemented by three modules – a segmenter, a Referring Tracker, and a Temporal Refiner.
The segmenter adopts Mask2Former to extract object features per frame. The Referring Tracker treats inter‑frame association as a reference denoising/reconstruction problem, using Referring Cross Attention:
RCA(ID, Q, K, V) = ID + MHA(Q, K, V)The Temporal Refiner aggregates aligned object queries with 1‑D convolutions and self‑attention to refine masks and tracks.
DVIS can operate in both online and offline modes, requires only ~5% extra computation over the segmenter, and works for VIS, video semantic segmentation (VSS), and video panoptic segmentation (VPS).
Results
DVIS achieves SOTA on OVIS, YouTube‑VIS (2019, 2021), and VIPSeg, maintaining top performance since February 2023. Ablation studies on OVIS show the Referring Tracker improves heavily occluded objects by 5.2 AP (moderate) and 4.3 AP (heavy), while the Temporal Refiner adds 2.4 AP (light), 1.8 AP (moderate), and 5.1 AP (heavy) improvements.
Computational cost of Tracker and Refiner together is less than 5% of the segmenter’s cost.
Overall results across datasets confirm DVIS’s superiority.
Conclusion
DVIS presents a flexible, decoupled approach that sets new SOTA across multiple video segmentation tasks and demonstrates strong generality, with potential applications in content creation, user growth, and foundational tools at Kuaishou.
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.