Artificial Intelligence 11 min read

Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution

This article describes iQIYI's short‑video content tagging system, outlining the challenges of extracting type and abstract tags from multimodal data, detailing the evolution from text‑only models to image‑fusion, BERT‑enhanced, and video‑frame models, and discussing their applications and future directions.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution

With the rapid growth of short‑video platforms, efficiently distributing massive video content has become a critical problem. iQIYI's content tagging technology, a key component of content understanding, is widely used in recommendation pipelines such as user profiling, recall, and ranking.

Tags are divided into two categories: "type tags" that follow a predefined hierarchical taxonomy, and "content tags" that are generated as open‑ended keywords or phrases describing the video. The article introduces the difficulties of extracting accurate tags from multimodal inputs, noting low inter‑annotator agreement (22.1%) and the prevalence of abstract tags that do not appear in titles.

The tagging model has undergone four major iterations:

Text model : initially relied on candidate generation + ranking using CRF‑extracted candidates, synonym/alias/entity/upper‑concept associations, and a simple attention‑based semantic similarity ranking. This approach struggled with abstract tags and short titles.

Cover‑image fusion model : incorporated visual features extracted from cover images using a fine‑tuned Xception network, exploring three fusion strategies (adding image features to the encoder input, encoder output, or decoder initial input).

BERT‑vector fusion model : integrated BERT sentence embeddings (second‑to‑last layer average‑pooled) into the encoder and decoder to improve general‑domain semantic understanding.

Video‑frame fusion model : added key video frames extracted and encoded by Xception, concatenated with text, BERT, and image features, and processed through early self‑attention and deep cross‑attention mechanisms, followed by an enhanced multi‑head decoder.

These multimodal models have been applied in three main scenarios at iQIYI:

Short‑video production : high‑precision automatic tags replace manual annotation, achieving over 90% accuracy for more than 60% of tags and guiding content creation.

Personalized recommendation : fine‑grained content tags serve as user interest signals, improving recall explainability and ranking accuracy.

Search : abstract tags enhance semantic matching with queries, enable query expansion, and support long‑query processing.

Future work aims to further improve annotation quality, incorporate additional modalities such as OCR, audio, and richer entity knowledge, and explore newer model architectures to boost performance on emerging short‑video content.

transformershort videoBERTmultimodal learningiQIYIcontent tagging
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.