Artificial Intelligence 26 min read

Two-Stage Video Restoration Framework for NTIRE 2022 Video Enhancement Challenge

The TaoMC2 framework, a two‑stage pipeline that augments BasicVSR++ with peak‑quality frames and deep residual blocks in Stage I and refines results with a SwinIR transformer in Stage II, leverages progressive training and transfer learning to boost PSNR to 33.16 dB and secured two championship titles and a runner‑up in the NTIRE 2022 video enhancement challenge.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Two-Stage Video Restoration Framework for NTIRE 2022 Video Enhancement Challenge

The NTIRE 2022 competition on video super‑resolution and quality enhancement announced that the TaoVideo team from Alibaba’s Taobao technical video enhancement group achieved two championship titles (Track 1 video enhancement and Track 2 2× super‑resolution) and one runner‑up (Track 3 4× super‑resolution).

The challenge provides a benchmark for compressed video enhancement and super‑resolution using diverse scenes (animals, cities, indoor, parks, etc.) with high‑quality 4K source videos. Three tracks are defined: Track 1 focuses on restoring heavily compressed video (e.g., HEVC), Track 2 adds a 2× up‑sampling requirement, and Track 3 further extends to 4× up‑sampling.

Our proposed solution, named TaoMC2, consists of a two‑stage network. Stage I is based on BasicVSR++ with several modifications: (1) replacement of the second‑order compensation with Peak Quality Frames (PQF), (2) deepening the reconstruction module from 5 to 55 residual blocks, and (3) progressive training where the model is trained incrementally with 5, 15, 25, 35, 45, 55 residual groups. Stage II employs SwinIR, a state‑of‑the‑art image restoration transformer, to further remove compression artifacts and refine the output of Stage I. The two stages are cascaded: compressed frames → Stage I → intermediate results → Stage II → final enhanced video.

Training details: Stage I starts from the open‑source BasicVSR++ weights, fine‑tuned with Charbonnier loss for 300 k iterations using Adam and a cosine‑annealing schedule with warm‑up. Progressive training gradually adds residual blocks. A final fine‑tuning with MSE loss for 100 k iterations follows. Stage II is initialized from the SwinIR denoising model, then fine‑tuned on the combined NTIRE LDV dataset (240 qHD sequences) and a self‑collected YouTube 4K dataset (870 videos). Transfer learning and progressive training reduce training time and improve performance.

Experiments were conducted on NVIDIA V100 GPUs (4‑card setup). Quantitative results (PSNR) show that Stage I alone improves the baseline by 0.326 dB, while Stage II adds another 0.11 dB, reaching 33.16 dB on the offline validation set. Test‑time augmentation (8‑fold flips/rotations) and model ensembling further boost PSNR by ~0.13 dB. Ablation studies confirm the benefits of progressive training, transfer learning, and the two‑stage design.

Subjective visual comparisons demonstrate that the method restores fine details, reduces motion blur, and sharpens object edges in compressed videos. The approach achieved two championships and one runner‑up in the NTIRE 2022 challenge, demonstrating its effectiveness for real‑world video enhancement tasks.

References to related work on video super‑resolution, compression artifact reduction, and vision transformers are provided, covering methods such as BasicVSR, BasicVSR++, SwinIR, EDVR, VRT, and others.

deep learningvideo restorationcompression artifact removalNTIRE2022Super-Resolutiontwo-stage network
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.