Deep Semantic Relevance and Multi‑Modal Video Search at Alibaba Entertainment
This presentation details Alibaba Entertainment's video search system, covering its business scope, user‑value metrics, a layered algorithm framework, relevance challenges, multi‑modal retrieval, deep semantic relevance techniques, model selection, asymmetric twin‑tower deployment, multi‑stage knowledge distillation, and practical effect cases.
Speaker Runchen, Senior Algorithm Expert at Alibaba Entertainment, introduces the video search business that provides one‑stop search and recommendation across platforms such as Youku, OTT, PC, and ticketing services, handling billions of OGC and UGC videos.
Two user‑value dimensions are discussed: the tool attribute (accuracy, coverage, playability, experience metrics) and the distribution attribute (view count, watch time, revenue), which together form a multi‑objective ranking problem.
The search algorithm framework consists of a data layer, a basic‑technology layer, an intent layer, a recall layer, a relevance layer and a ranking layer, each briefly described.
Relevance challenges for video search include heterogeneous content understanding, entity knowledge matching, and deep semantic computation.
Four relevance feature groups are presented: basic textual features, knowledge features, posterior interaction features, and semantic features (e.g., DSSM, BERT).
Multi‑modal video search is motivated by the insufficiency of pure text matching; a three‑stage solution (CV → content understanding → multimodal recall → multimodal relevance ranking) is described, with examples of OCR, face recognition, and audio‑visual feature extraction.
Deep semantic relevance is explored through a three‑stage model pipeline: transfer (pre‑train BERT on Alibaba logs), adapt (multi‑task fine‑tuning for recall and ranking), and distill (knowledge distillation). Model choices include symmetric and asymmetric twin‑tower architectures, with multi‑stage distillation to close the performance gap.
Model selection discusses the trade‑off between interaction‑type BERT (high accuracy, offline) and dual‑tower models (online efficiency), and introduces an asymmetric twin‑tower where the document side stores multiple embeddings while the query side uses a lightweight BERT.
Multi‑stage distillation uses a large unlabeled transfer set and a manually annotated target set, applying soft‑label supervision, embedding MSE loss, and layer‑wise learning rates to preserve information during dimensionality reduction.
Knowledge‑enhanced semantic matching combines KG sub‑graph embeddings with text embeddings via attention, improving alignment for queries that require both entity knowledge and semantic understanding.
Effect cases demonstrate improved relevance ordering, higher user satisfaction, and better recall of semantically related videos after deploying the new system.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.