Artificial Intelligence 5 min read

Ultrafast Video Attention Prediction with Coupled Knowledge Distillation

The paper presents UVA‑Net, a lightweight video‑attention network trained via coupled knowledge distillation, which matches the accuracy of eleven state‑of‑the‑art models while using only 0.68 MB of storage and achieving up to 10,106 FPS on GPU (404 FPS on CPU), thanks to a MobileNetV2‑based CA‑Res block and a teacher‑student framework that leverages low‑resolution inputs to drastically cut parameters and computational cost.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Ultrafast Video Attention Prediction with Coupled Knowledge Distillation

This paper introduces a lightweight network UVA-Net and a coupled knowledge distillation training method for video attention prediction. The proposed approach achieves performance comparable to 11 state-of-the-art models while requiring only 0.68 MB of storage space. On GPU, it achieves 10,106 FPS, and on CPU, 404 FPS, representing a 206x improvement over previous models.

The paper addresses two key challenges in video saliency detection: reducing computational and storage requirements while maintaining processing efficiency, and extracting effective spatiotemporal joint features without accuracy degradation. To tackle these issues, the authors propose a lightweight video saliency detection method using coupled knowledge distillation.

The authors introduce a CA-Res block structure based on MobileNetV2, which significantly improves computational efficiency while maintaining accuracy. The coupled knowledge distillation approach uses low-resolution video frames as input to reduce computational load, then employs complex temporal and spatial networks as teacher models to supervise the training of a simpler spatiotemporal student model, dramatically reducing parameter count and storage requirements.

Experimental results on the AVS1K dataset show that UVA-DVA-64 achieves performance comparable to high-performance models with only 2.73M parameters and 404.3 FPS speed, while UVA-DVA-32, though slightly less accurate, requires only 0.68M parameters and achieves 10,106 FPS.

The proposed ultrafast video saliency detection algorithm demonstrates computational accuracy comparable to 11 international high-level methods, effectively addressing issues of insufficient model generalization and difficulty in combining temporal-spatial cues. The technology has been applied to iQiyi's image search for dramas, intelligent video creation, and other products, where saliency ROI detection significantly aids in understanding image and video content.

Knowledge Distillationlightweight neural networksMobile Video Processingsaliency detectionspatiotemporal feature extractionUVA-Netvideo attention prediction
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.