Multimodal Video Classification: Image Feature Improvements and System Insights
The talk presents Alibaba’s hierarchical video‑category system and a multimodal classification pipeline—leveraging EfficientNet, NeXtVLAD fusion, attention‑dropping augmentation, and MoCo contrastive learning—that together boost cold‑start recall by 43%, improve program classification over 20%, and set the stage for larger models and advanced unsupervised methods.
Speaker: Jiang Xiaosen, Alibaba Entertainment Algorithm Expert (source: Alibaba Entertainment Tech Night Talk #13).
Overview: A hierarchical category system is crucial for video platform operations and recommendation cold‑start. The system, built jointly by the category‑construction team, operations, and review, provides one‑ and two‑level categories that support operation selection, inventory, data analysis, and cold‑start recommendation.
Business Value of the Category System
Operations – facilitates inventory management, warehousing, efficiency improvement, and quality analysis.
Recommendation – improves cold‑start recall by using second‑level categories to match new videos with user profiles.
Search – enhances relevance by predicting categories from query keywords.
Construction Process and Results
The process iterates between defining coarse standards, training annotators, labeling samples, and continuously refining the standards. The resulting three‑level taxonomy is widely applied, yielding a 43% PVCTR increase for cold‑start recall and notable gains in search accuracy.
Program Classification
Program classification (short‑long association) identifies the source program of a video fragment, providing fine‑grained value for both operation and recommendation.
Multimodal Video Classification Algorithm
The pipeline consists of four stages:
Multimodal embedding of video frames, text, and audio.
NeXtVLAD‑based fusion to produce a single‑frame feature.
Gating network to emphasize useful dimensions and suppress irrelevant ones.
Classification head (Mixture‑of‑Experts) that outputs the final prediction.
Feature Improvements
1. Backbone Upgrade – EfficientNet was selected for its superior accuracy‑efficiency trade‑off and strong fine‑grained representation capability.
2. Data Augmentation – Attention Dropping (erasing high‑activation regions) and Cropping (zooming informative regions) are applied to force the network to learn diverse cues.
3. Training Methods – Both supervised classification and unsupervised instance‑level contrastive learning (MoCo) are explored. Unsupervised learning avoids the bias toward non‑essential cues (e.g., logos, black borders) that supervised training may capture.
Experimental Results
Targeted data augmentation improves program classification performance by >20% on Youku data.
Better backbones and unsupervised learning yield large gains, especially for large‑scale category sets.
Future work: larger models and more advanced unsupervised methods (e.g., SimCLR) to sustain performance growth.
Discussion & Future Plans
Key topics from the Q&A include multimodal feature fusion, joint vs. separate training of modality‑specific models, sample balancing, hierarchical model updates, and few‑shot learning challenges.
Future directions involve exploring stronger unsupervised algorithms (SimCLR), improving multimodal fusion networks, and enhancing feature extraction to reduce reliance on manual labeling.
Thank you for attending.
Youku Technology
Discover top-tier entertainment technology here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.