Artificial Intelligence 12 min read

DVIS: Decoupled Framework that Sets New SOTA in Video Instance Segmentation

DVIS introduces a decoupled video instance segmentation framework that splits the task into segmentation, tracking, and refinement modules, achieving state-of-the-art performance across VIS, VPS, and VSS benchmarks while maintaining low computational overhead, and demonstrates robustness in both online and offline settings.

Kuaishou Large Model

Sep 27, 2023

DVIS: Decoupled Framework that Sets New SOTA in Video Instance Segmentation

Abstract

Video segmentation extends image segmentation to simultaneously segment, detect, and track all objects in a video, offering temporally stable and accurate results crucial for editing, autonomous driving, and surveillance.

Background

Transformer‑based models such as DETR have advanced object detection and instance segmentation. VisTR applied Transformers to video instance segmentation (VIS), making Transformer methods mainstream in this field.

Online vs Offline Methods

Online methods use current and past frames for real‑time tasks like autonomous driving, while offline methods can leverage any frame for tasks like video editing.

Limitations of Existing SOTA

Current online SOTA (e.g., MinVIS, IDOL) perform image segmentation first then associate instances frame‑by‑frame, failing to exploit video context. Offline SOTA (e.g., SeqFormer, Mask2Former‑VIS, VITA, IFC) use tightly coupled end‑to‑end networks but struggle with long videos and heavy occlusions, as shown by tracking errors in Mask2Former‑VIS.

Proposed DVIS Framework

DVIS decouples VIS into three sub‑tasks: segmentation, tracking, and temporal refinement, implemented by three modules – a segmenter, a Referring Tracker, and a Temporal Refiner.

The segmenter adopts Mask2Former to extract object features per frame. The Referring Tracker treats inter‑frame association as a reference denoising/reconstruction problem, using Referring Cross Attention:

RCA(ID, Q, K, V) = ID + MHA(Q, K, V)

The Temporal Refiner aggregates aligned object queries with 1‑D convolutions and self‑attention to refine masks and tracks.

DVIS can operate in both online and offline modes, requires only ~5% extra computation over the segmenter, and works for VIS, video semantic segmentation (VSS), and video panoptic segmentation (VPS).

Results

DVIS achieves SOTA on OVIS, YouTube‑VIS (2019, 2021), and VIPSeg, maintaining top performance since February 2023. Ablation studies on OVIS show the Referring Tracker improves heavily occluded objects by 5.2 AP (moderate) and 4.3 AP (heavy), while the Temporal Refiner adds 2.4 AP (light), 1.8 AP (moderate), and 5.1 AP (heavy) improvements.

Computational cost of Tracker and Refiner together is less than 5% of the segmenter’s cost.

Overall results across datasets confirm DVIS’s superiority.

Conclusion

DVIS presents a flexible, decoupled approach that sets new SOTA across multiple video segmentation tasks and demonstrates strong generality, with potential applications in content creation, user growth, and foundational tools at Kuaishou.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision deep learning Transformer instance segmentation video segmentation

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.