Artificial Intelligence 12 min read

Content Tagging Technology for Short Videos: Challenges and Model Evolution at iQIYI

This article examines the challenges of short‑video content tagging and describes iQIYI's multi‑stage evolution from simple text‑only models to sophisticated multimodal architectures that fuse cover images, BERT embeddings, and video frames to improve tag generation accuracy.

DataFunTalk

Feb 27, 2020

Content Tagging Technology for Short Videos: Challenges and Model Evolution at iQIYI

With the rapid growth of short‑video platforms, efficiently distributing massive video streams requires accurate content tagging, which is crucial for user profiling, recall, and ranking in recommendation systems. Tags are divided into predefined type tags and open‑set content tags generated from video semantics.

Challenges of Content Tagging Short videos contain titles, cover images, and video content; extracting reliable tags demands multimodal fusion. The open‑set nature of content tags makes label selection difficult, and human annotators achieve only about 22% agreement.

Algorithmic Evolution

(1) Text Model Initially, only video titles were used. Candidate tags were generated via CRF extraction, lexical association rules (synonyms, aliases, entities, higher‑level concepts), and high‑frequency pseudo‑type tags, followed by an attention‑based semantic similarity ranking model. This extractive approach struggled with abstract tags and short titles.

To address these limits, a generative Transformer‑based model was introduced, first generating tags and falling back to extraction when necessary. Improvements included replacing the attention mechanism with self‑attention, adding contextual features, and integrating BERT sentence embeddings.

(2) Cover‑Image Fusion Model Image features were extracted using a fine‑tuned Xception network, selecting high‑frequency abstract tags as classification targets. The image vector was fused with the Transformer encoder input, output, or decoder initial state, each mapped through separate feed‑forward networks.

(3) BERT Vector Fusion Model To improve general‑domain understanding, BERT sentence embeddings (second‑to‑last layer averaged) were incorporated, added to encoder inputs/outputs and decoder initial inputs after non‑linear projection.

(4) Video‑Frame Fusion Model Key frames were sampled from each video and encoded with Xception to obtain frame vectors. Early fusion concatenated text, BERT, image, and frame features before self‑attention; deep fusion applied cross‑attention between text and video features. The decoder used enhanced multi‑head self‑attention to combine early and deep fused representations.

Applications of Content Tags The generated tags are widely used at iQIYI for short‑video production (replacing manual labeling, achieving >90% precision for 60% of tags), personalized recommendation (enhancing recall and explainability), and video search (semantic matching, query expansion, and term weighting).

Future Directions Ongoing work aims to improve annotation quality, incorporate richer priors such as entity relations, and integrate additional modalities like OCR, audio, and detailed video semantics to further boost model performance.

References

[1] Zhou et al., Attention‑based bidirectional LSTM for relation classification, ACL 2016. [2] Yang et al., Hierarchical attention networks for document classification, NAACL‑HLT 2016. [3] Das et al., Siamese networks for similar question retrieval, ACL 2016. [4] Calixto et al., Incorporating global visual features into attention‑based NMT, EMNLP 2017. [5] Lu et al., ViLBERT: Pretraining task‑agnostic visiolinguistic representations, NeurIPS 2019. [6] Arslan et al., Doubly attentive transformer machine translation, arXiv 2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer short video BERT Multimodal Learning iQIYI content tagging

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.