iQIYI Deep Semantic Representation Learning Framework for Video Recommendation and Search
Based on academic and industry experience, iQIYI has designed a deep semantic representation learning framework that integrates multimodal side information and deep models such as Transformers and graph neural networks, improving recall, ranking, deduplication, diversity and semantic matching across recommendation and search scenarios.
iQIYI’s technical product team presents a deep semantic representation learning framework that leverages both academic research and industrial practice. The framework is applied to various iQIYI services—including short and long video recommendation, live streaming, and search—enhancing recall, ranking, deduplication, diversity control, and semantic matching, thereby improving user consumption time and search relevance.
Background : Distributed representation concepts originated from linguists and early neural language models, leading to word2vec and the widespread adoption of embeddings for text, images, audio, and video. Embedding transforms high‑dimensional sparse one‑hot vectors into low‑dimensional dense semantic vectors, enabling similarity measurement across heterogeneous entities.
Challenges : Entity and relationship diversity: iQIYI data includes texts, images, videos, users, circles, queries, and multiple interaction types. Rich side information: items possess multimodal content and extensive meta attributes that traditional shallow models fail to exploit. Varied business scenarios: different tasks (recall, ranking, deduplication, diversity, semantic matching, clustering) require distinct embedding types.
Framework Overview : The solution consists of four layers—Data, Feature, Strategy, and Application. The Data layer collects user behavior and constructs graphs; the Feature layer extracts and fuses multimodal representations (text, image, audio, video); the Strategy layer provides diverse deep semantic models and evaluation methods; the Application layer serves embeddings, nearest‑neighbor, and similarity services to downstream systems.
Feature Extraction & Fusion : Text: pretrained language models (e.g., BERT, ALBERT) are fine‑tuned for token‑, sentence‑, paragraph‑, and document‑level representations, with topic modeling and WME/CPTW for higher‑level aggregation. Image: EfficientNet pretrained on ImageNet plus self‑supervised learning (Selfie) yields robust visual embeddings. Audio/Video: Vggish extracts 128‑dimensional audio features; key‑frame image embeddings are aggregated for video‑level semantics. Fusion timing: early, late, and hybrid fusion are explored, with hybrid fusion generally delivering the best cross‑modal interaction. Fusion methods: element‑wise operations, bilinear pooling (MFB, MFH), and attention‑based mechanisms (BAN) are employed.
Deep Semantic Models : Content‑based models: image classification embeddings (ImageNet) and task‑specific supervised embeddings (e.g., tag classification) provide cold‑start‑friendly representations. Matching‑based models: Siamese or multi‑tower networks (DSSM, CDML) combine content and behavior signals to learn joint query‑item embeddings. Sequence‑based models: replace shallow skip‑gram or RNNs with Transformers (SASRec, BERT4Rec, XLNet‑style) to capture long‑range user behavior dependencies. Graph‑based models: traditional graph embeddings (DeepWalk, node2vec) are enhanced with side information, multimodal features, and advanced GNNs/GCNs (PinSAGE, ClusterGCN, ProNE, HGT) to model heterogeneous, high‑order relationships.
Optimization & Future Work : Video‑level pretraining: plan to train large‑scale video‑language models (e.g., UniViLM) using caption data. Knowledge‑graph integration: incorporate entity and relation priors (KEPLER, KGCN) to enrich textual and recommendation embeddings. Business expansion: extend the framework to more iQIYI scenarios such as intelligent creation, short‑video, live‑stream, and comic recommendation.
References : The article cites foundational works on word embeddings, Transformers, graph neural networks, multimodal pooling, and recent advances in recommendation and search modeling.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.