Artificial Intelligence 17 min read

Multimodal Video Classification: Image Feature Improvements and System Insights

The talk presents Alibaba’s hierarchical video‑category system and a multimodal classification pipeline—leveraging EfficientNet, NeXtVLAD fusion, attention‑dropping augmentation, and MoCo contrastive learning—that together boost cold‑start recall by 43%, improve program classification over 20%, and set the stage for larger models and advanced unsupervised methods.

Youku Technology
Youku Technology
Youku Technology
Multimodal Video Classification: Image Feature Improvements and System Insights

Speaker: Jiang Xiaosen, Alibaba Entertainment Algorithm Expert (source: Alibaba Entertainment Tech Night Talk #13).

Overview: A hierarchical category system is crucial for video platform operations and recommendation cold‑start. The system, built jointly by the category‑construction team, operations, and review, provides one‑ and two‑level categories that support operation selection, inventory, data analysis, and cold‑start recommendation.

Business Value of the Category System

Operations – facilitates inventory management, warehousing, efficiency improvement, and quality analysis.

Recommendation – improves cold‑start recall by using second‑level categories to match new videos with user profiles.

Search – enhances relevance by predicting categories from query keywords.

Construction Process and Results

The process iterates between defining coarse standards, training annotators, labeling samples, and continuously refining the standards. The resulting three‑level taxonomy is widely applied, yielding a 43% PVCTR increase for cold‑start recall and notable gains in search accuracy.

Program Classification

Program classification (short‑long association) identifies the source program of a video fragment, providing fine‑grained value for both operation and recommendation.

Multimodal Video Classification Algorithm

The pipeline consists of four stages:

Multimodal embedding of video frames, text, and audio.

NeXtVLAD‑based fusion to produce a single‑frame feature.

Gating network to emphasize useful dimensions and suppress irrelevant ones.

Classification head (Mixture‑of‑Experts) that outputs the final prediction.

Feature Improvements

1. Backbone Upgrade – EfficientNet was selected for its superior accuracy‑efficiency trade‑off and strong fine‑grained representation capability.

2. Data Augmentation – Attention Dropping (erasing high‑activation regions) and Cropping (zooming informative regions) are applied to force the network to learn diverse cues.

3. Training Methods – Both supervised classification and unsupervised instance‑level contrastive learning (MoCo) are explored. Unsupervised learning avoids the bias toward non‑essential cues (e.g., logos, black borders) that supervised training may capture.

Experimental Results

Targeted data augmentation improves program classification performance by >20% on Youku data.

Better backbones and unsupervised learning yield large gains, especially for large‑scale category sets.

Future work: larger models and more advanced unsupervised methods (e.g., SimCLR) to sustain performance growth.

Discussion & Future Plans

Key topics from the Q&A include multimodal feature fusion, joint vs. separate training of modality‑specific models, sample balancing, hierarchical model updates, and few‑shot learning challenges.

Future directions involve exploring stronger unsupervised algorithms (SimCLR), improving multimodal fusion networks, and enhancing feature extraction to reduce reliance on manual labeling.

Thank you for attending.

Feature EngineeringAImultimodalunsupervised learningvideo classificationEfficientNet
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.