Artificial Intelligence 10 min read

How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

The article introduces DeViT, a novel Deformed Vision Transformer framework for video inpainting that leverages a deformable patch homography estimator, mask‑pruned attention, and spatio‑temporal weight adaptation, achieving state‑of‑the‑art results on benchmark datasets and highlighting its potential for advanced video editing tools.

Kuaishou Audio & Video Technology

Sep 29, 2022

How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

Background

Video inpainting is essential for tasks such as object removal, damage restoration, and AR integration, yet it remains less explored than image inpainting. The ACM MM 2022 conference accepted a paper titled “DeViT: Deformed Vision Transformers in Video Inpainting,” presenting a new Transformer‑based approach.

Method

The proposed DeViT framework consists of three key components:

Deformed Patch Homography Estimator (DePtH) : Introduces a deformable patch homography estimator that learns offset vectors without extra supervision, enabling precise alignment of patch features for challenging motions and deformations.

Mask‑Pruned Patch Attention (MPPA) : Utilizes mask‑guided pruning to reduce the influence of invalid pixels, improving attention matching between deformed patches.

Spatio‑Temporal Weight Adapter (STA) : Dynamically allocates attention weights between spatial and temporal information based on video motion types, enhancing performance on diverse video scenarios.

Training Objective

The model is optimized with a pixel‑wise reconstruction loss combined with an adversarial loss to improve visual fidelity.

Experimental Results

DeViT was evaluated on two large public multimedia datasets across four metrics, outperforming current state‑of‑the‑art methods. Experiments categorized videos by motion type (static, slow translation, complex deformation) and demonstrated the effectiveness of DePtH, MPPA, and STA in each scenario.

Conclusion

DeViT provides a comprehensive framework for precise patch alignment and matching in video tasks, integrating DePtH for deformation handling, MPPA for refined attention, and STA for adaptive spatio‑temporal weighting, achieving superior quantitative and qualitative results in video inpainting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Multimedia DeViT Video Inpainting

Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.