Short Video Content Tagging: Multimodal AI Model Framework and Applications
The framework tags short videos by fusing text, image and audio‑video features through specialized extraction, classification, generative and retrieval modules, then ranking candidates with a multimodal BERT model, delivering accurate, business‑specific tags that boost recommendation, search and advertising.
With the rapid growth of user‑generated short videos, platforms need efficient ways to distribute and utilize this massive content. Tags that concisely describe video subjects are crucial for recommendation, search, advertising and other business scenarios.
Tag generation relies on multimodal metadata such as titles, descriptions, uploader profiles, visual frames and audio tracks. Effective algorithms must fuse these heterogeneous signals to capture the full semantics of a short video.
Key challenges include the lack of objective evaluation standards, varying annotation guidelines across business lines, the abstract nature of many tags, the need to understand previously unseen content, and constantly evolving labeling rules.
The overall solution ingests multimodal inputs, extracts features with various pre‑trained models, and then combines several recall modules: a text‑based extraction model, a classification model for high‑quality tag categories, a generative multimodal model, as well as similar‑video retrieval and face‑recognition modules. A ranking model assigns confidence scores and produces the final tag set tailored to each business line.
Modality Layer – Text Model: A domain‑adapted ALBERT model pre‑trained on massive short‑video text and fine‑tuned with Masked Language Modeling and a shortened Sentence‑Order Prediction task suitable for short titles and descriptions.
Modality Layer – Image Model: Convolutional networks such as ResNet‑50, Inception‑V3, Xception, EfficientNet and BigTransfer are employed. EfficientNet’s compound scaling of width, depth and resolution is highlighted for optimal performance.
Modality Layer – Audio‑Video Model: The MixNeXtVLAD architecture, evolved from NetVLAD and NeXtVLAD, processes video frame features (via image models) and audio features extracted by VGGish. Knowledge distillation with multiple student branches and SE‑Context Gating enhances multimodal feature fusion.
Recall Layer – Multimodal Fusion: Tag generation is treated as a machine‑translation task using a Transformer encoder‑decoder. The encoder consumes multimodal embeddings, while the decoder generates tag sequences.
Recall Layer – Extraction: A BERT‑BiLSTM‑CRF pipeline extracts tags directly from text, leveraging mature entity‑recognition techniques.
Recall Layer – Hierarchical Classification: Dense‑HMCN classifies around 3,000 high‑quality tags in a hierarchical structure, supplemented by LightGBM‑derived sparse features.
Additional recall methods include face‑recognition, similar‑video tag retrieval, and knowledge‑graph‑based expansion to enrich tag candidates.
Ranking Layer: A BERT‑based multimodal scoring model combines early fusion (feeding each modality as separate sentences) and late fusion (merging tag candidates, weights and a [CLS] token) to assign a final confidence score to each tag.
The system brings tangible value to short‑video platforms by streamlining video production, enhancing personalized recommendation, and enabling intelligent operations such as region‑specific distribution and automated content understanding.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.