Artificial Intelligence 13 min read

Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI

iQIYI’s short‑video tagging system tackles multimodal fusion, open‑set and abstract tags by evolving from a text‑only model through cover‑image, BERT‑vector, and video‑frame fusion architectures, enabling automated labeling, personalized recommendation, and semantic search while planning to add OCR, audio, and knowledge‑graph enhancements.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI

Introduction

With the rise of short videos, massive amounts are uploaded daily. Efficient intelligent distribution of these videos is a key challenge for platforms. Content tagging is an important technique for content understanding and is widely used in recommendation pipelines such as user profiling, recall, and ranking. Tags are divided into type tags (pre‑defined hierarchical categories) and content tags (open‑set keywords generated from video content). This article describes iQIYI’s content‑tag technology for short videos.

Challenges of Content Tagging

Short videos consist of title text, cover image, and video frames. Accurate tag extraction requires multimodal fusion. Difficulties include: (1) fusing heterogeneous modalities, (2) dealing with an open set of tags, (3) low inter‑annotator agreement (only 22.1% consistency). Many tags (over 40%) are “abstract tags” that do not appear in the title, e.g., a video titled “Mother falls ill…” receives tags “inspirational”, “positive energy”.

Algorithm Evolution

iQIYI’s tag model evolved through four stages: text‑only model, cover‑image fusion model, BERT‑vector fusion model, and video‑frame fusion model. Each stage is described below.

(1) Text Model

The text model uses only the video title and description. Initially a candidate‑generation + ranking framework was employed. Candidate tags come from CRF extraction, manually defined association rules (synonyms, aliases, entity linking, hypernyms) and high‑frequency tags not present in the text (pseudo‑type tags). Ranking uses an attention‑based semantic similarity model that encodes the title into a vector and computes similarity with candidate vectors.

Limitations of the pure text model include poor performance on abstract tags and short titles.

To improve abstraction, a Transformer‑based generative model was introduced, combined with the extractive approach. Self‑attention replaces the earlier attention mechanism, and additional contextual features (channel, etc.) are incorporated.

(2) Cover‑Image Fusion Model

Image features are extracted by fine‑tuning pretrained ImageNet models. Experiments showed Xception performed best. High‑frequency abstract tags are used as image classification labels; the penultimate layer of Xception provides the cover‑image vector.

Three fusion strategies were explored: adding image features to the encoder input, encoder output, or decoder initial input. Each path is mapped through a feed‑forward network before integration.

(3) BERT Vector Fusion Model

Because the text data is biased toward entertainment, a general‑purpose pretrained BERT model was incorporated to enhance semantic understanding. BERT sentence embeddings (second‑to‑last layer averaged) are added to the encoder and decoder via nonlinear projections.

(4) Video‑Frame Fusion Model

Key frames are sampled from each video and encoded with Xception to obtain frame vectors. Early fusion concatenates text, BERT, image, and frame vectors followed by self‑attention. Deep fusion uses cross‑attention between text and video features. The decoder employs enhanced multi‑head self‑attention to combine early and deep fused representations.

Applications of Content Tags

iQIYI uses the tags for short‑video production (automating labeling, improving efficiency), personalized recommendation (fine‑grained user interest modeling, recall, ranking), and video search (semantic matching, query expansion, term weighting).

Future Directions

Further work includes incorporating entity and relation knowledge from titles, adding more modalities such as OCR,人物, audio, and exploring newer model architectures to boost precision.

References

[1] Zhou et al., “Attention‑based bidirectional LSTM for relation classification”, ACL 2016. [2] Yang et al., “Hierarchical attention networks for document classification”, NAACL‑HLT 2016. [3] Das et al., “Together we stand: Siamese networks for similar question retrieval”, ACL 2016. [4] Calixto et al., “Incorporating Global Visual Features into Attention‑Based NMT”, EMNLP 2017. [5] Lu et al., “ViLBERT: Pretraining Task‑Agnostic Visio‑Linguistic Representations for Vision‑and‑Language Tasks”, NeurIPS 2019. [6] Arslan et al., “Doubly attentive transformer machine translation”, arXiv 2018.

recommendationtransformershort videoBERTmultimodal learningiQIYIcontent tagging
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.