Artificial Intelligence 10 min read

How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

The article introduces DeViT, a novel Deformed Vision Transformer framework for video inpainting that leverages a deformable patch homography estimator, mask‑pruned attention, and spatio‑temporal weight adaptation, achieving state‑of‑the‑art results on benchmark datasets and highlighting its potential for advanced video editing tools.

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

Background

Video inpainting is essential for tasks such as object removal, damage restoration, and AR integration, yet it remains less explored than image inpainting. The ACM MM 2022 conference accepted a paper titled “DeViT: Deformed Vision Transformers in Video Inpainting,” presenting a new Transformer‑based approach.

Method

The proposed DeViT framework consists of three key components:

Deformed Patch Homography Estimator (DePtH) : Introduces a deformable patch homography estimator that learns offset vectors without extra supervision, enabling precise alignment of patch features for challenging motions and deformations.

Mask‑Pruned Patch Attention (MPPA) : Utilizes mask‑guided pruning to reduce the influence of invalid pixels, improving attention matching between deformed patches.

Spatio‑Temporal Weight Adapter (STA) : Dynamically allocates attention weights between spatial and temporal information based on video motion types, enhancing performance on diverse video scenarios.

Training Objective

The model is optimized with a pixel‑wise reconstruction loss combined with an adversarial loss to improve visual fidelity.

Experimental Results

DeViT was evaluated on two large public multimedia datasets across four metrics, outperforming current state‑of‑the‑art methods. Experiments categorized videos by motion type (static, slow translation, complex deformation) and demonstrated the effectiveness of DePtH, MPPA, and STA in each scenario.

Conclusion

DeViT provides a comprehensive framework for precise patch alignment and matching in video tasks, integrating DePtH for deformation handling, MPPA for refined attention, and STA for adaptive spatio‑temporal weighting, achieving superior quantitative and qualitative results in video inpainting.

computer visiondeep learningTransformermultimediaDeViTvideo inpainting
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.