Artificial Intelligence 13 min read

iQIYI Multimodal Technology: Datasets, Applications, and Future Directions

iQIYI leverages multimodal AI—combining audio, visual, and textual cues—to advance video understanding, releasing the world’s largest celebrity dataset (iQIYI‑VID), powering applications such as actor‑focused playback, AI Radar, emoji generation, and rapid automated editing, while pursuing future research in emoji captioning, cross‑modal retrieval, visual question answering, and broader health‑care and education uses.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Multimodal Technology: Datasets, Applications, and Future Directions

Online video watching has become a daily habit for many people, and viewers now expect not only abundant content but also a high-quality viewing experience. iQIYI is leveraging multimodal technology—integrating audio, visual, and textual cues—to improve video understanding and user services.

In 2016, iQIYI won the international EmotioW video facial expression recognition competition by jointly using facial and audio modalities, which sparked deeper research into multimodal methods.

To promote multimodal research, iQIYI released the world’s largest celebrity dataset, iQIYI‑VID, in 2018, and followed with an expanded version, iQIYI‑VID‑2019, in 2019. The 2019 dataset contains 10,000 celebrities, 200 hours of video, and 200 k video clips, adding 5,000 new stars and more diverse short‑video scenarios. Dataset links: iQIYI‑VID‑2018 iQIYI‑VID‑2019

iQIYI has applied multimodal AI to several products:

只看 TA : Allows users to watch only scenes featuring a selected actor or pair, using facial, body, and scene recognition on the mobile app.

AI Radar (TV) : Enables remote‑control‑based character identification on TV, combining face, scene, and audio analysis.

逗芽表情 : An AI‑driven emoji generation mini‑program that extracts facial expressions from videos and matches them with appropriate text.

Starworks : An intelligent video‑editing system that automatically searches for relevant clips, assembles them according to a script, and produces short videos in under a minute, using multimodal cues such as face, speech, subtitles, and music beats.

The underlying technologies include face and expression recognition, scene and clothing detection, speech‑to‑text OCR, NLP for subtitle analysis, audio analysis, beat detection, shot segmentation, visual effects, and video filtering. iQIYI maintains a celebrity face library of over one million identities and a cartoon library of more than 20 k characters, continuously updated through AI monitoring of trends and view counts.

Dr. Lu Xiangju, scientist and head of the PersonAI team, notes that perfect emotion recognition and visual semantics remain challenging because machines lack genuine feelings. Current approaches combine multimodal feature fusion (end‑to‑end black‑box models) with single‑modality tagging followed by high‑level semantic abstraction.

Future directions highlighted include automatic emoji caption generation, cross‑modal retrieval, visual question answering, and broader applications such as classroom monitoring and health care, emphasizing that multimodal AI will increasingly break single‑modality limits to better match human perception.

multimodal AIcomputer visiondatasetsmachine learningiQIYIvideo analysis
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.