Multimodal Video Quality Assessment Models for Short Video Platforms
The paper presents an integrated multimodal quality assessment system for short‑video platforms that evaluates cover images, video content, and accompanying text using deep‑learning and handcrafted features—combining ResNet‑50, NetVLAD, TSN, VGGish, and XGBoost—to improve user experience, recommendation accuracy, and operational efficiency, with plans for optimization and modular deployment.
Short video information‑flow products dominate fragmented user time, with over 600 million monthly active devices in 2018. The massive influx of user‑generated content leads to varied video quality, making large‑scale, accurate quality assessment essential for improving user experience and recommendation algorithms.
The main low‑quality issues are summarized as:
1. Cover image quality: blur, black borders, distortion, darkness, lack of subject, or meaningless visuals.
2. Video content quality: meaningless, boring, unclear, screen‑tear, ads, vulgar content, etc.
3. Text quality: overly simple titles, excessive symbols, ungrammatical sentences, click‑bait, mismatch between text and image.
To address these problems, a comprehensive video quality model integrating text, image, content, and audio inputs was built, consisting of three sub‑models:
1. Cover Image Quality Model: combines deep features extracted by a convolutional network with handcrafted image features to evaluate cover quality.
2. Video Content Quality Model: an end‑to‑end multimodal deep model that processes visual frames, optical flow, and audio signals.
3. Text Quality Model: a classification model based on textual structure and semantic features, implemented with XGBoost.
The cover image model uses a deep‑and‑wide architecture. Deep features are obtained from ResNet‑50 (mid‑level layers) and further refined with batch‑normalized hidden layers. Wide features include handcrafted low‑level descriptors (edge distribution, color statistics, blur kernels) and high‑level aesthetic features from the Google NIMA model. Feature fusion is performed via Compact Bilinear Pooling (CBP) to capture interactions between deep and wide representations.
For video content, the model adopts NetVLAD for frame‑level visual aggregation and Temporal Segment Network (TSN) for motion representation via optical flow. Audio features are extracted with a pretrained VGGish network (128‑dimensional embeddings per frame). A multi‑branch end‑to‑end network combines NetVLAD, TSN, and audio DNN outputs, with self‑attention applied to frame‑level features to weight important segments.
Application scenarios include:
Uploader prompts: quality scores guide users to improve cover images before upload.
Corpus entry/exit: low‑quality videos are filtered out in real‑time, reducing manual review costs.
Boosting high‑quality video exposure: quality scores are incorporated into recall and ranking models, improving retention metrics.
Future work focuses on three directions:
1. Feature extraction optimization: reduce computational cost and shift toward deeper learned features.
2. Algorithm model optimization: develop shared multimodal representations and end‑to‑end multi‑task learning across text, image, and video.
3. Adaptive business scenarios: build modular quality sub‑models that can be combined according to specific product needs.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.