Artificial Intelligence 17 min read

Multimodal Content Understanding Techniques in Search Systems

This talk presents Tencent's multimodal content understanding framework for search, covering hierarchical content features, large‑scale ranking, fine‑grained image semantic vectors, video and document analysis, quality detection, duplicate removal, and future directions in AI‑driven search.

DataFunTalk
DataFunTalk
DataFunTalk
Multimodal Content Understanding Techniques in Search Systems

The presentation introduces a comprehensive multimodal content understanding system used in general search, highlighting two main components: content features and index selection, which together enable fine‑grained modeling from character to page level for ranking.

It details page‑level understanding (semantic segmentation, core element extraction, visual‑based methods), image understanding (quality assessment, multimodal semantic matching), paragraph‑level NLP tasks (topic modeling, segmentation), sentence‑level tasks (language detection, fluency, similarity), and character‑level typo detection using BERT‑based sequence labeling.

The talk then explains the four‑layer image‑text understanding pipeline: low‑level content parsing (KIE, layout analysis, page type detection), quality and authority estimation, image‑text matching (using BERT, later VIT and large Chinese image‑text datasets), and attribute extraction (domain, region, site authority).

For large‑scale ranking over a trillion‑size web corpus, the system employs pre‑processing signatures, LTR and LR models with hundreds of features (page rank, user rank, site rank), multi‑stage index selection (VIP, secondary, tertiary), and a two‑stage recall strategy that balances lightweight and heavyweight features.

Fine‑grained image semantic vectors are used for duplicate, instance, and semantic retrieval; challenges include scale‑sensitive vector search and diverse retrieval needs, addressed by multi‑label embeddings, quantization, and a combination of metric learning and classification models (MobileNet, Open Images pre‑training, asymmetric loss).

Multimodal quality detection combines large RCNN for image regions with paragraph tokenization for text, and employs the UNITER model (matching loss, masked region loss, masked token loss) trained on ~70 million samples, achieving ~12% AUC improvement.

Document domain authority recognition uses a dual‑tower architecture with TextCNN for queries and RoBERTa + CNN + Attention for authors, followed by joint training and online hard‑negative mining.

Duplicate detection at massive scale follows a two‑stage paradigm: lightweight feature computation for fast filtering, followed by heavyweight embedding‑based similarity ranking, reducing storage cost and improving performance.

The future outlook emphasizes unsupervised large‑scale training and cross‑modal knowledge interaction to further enhance multimodal search capabilities.

A brief Q&A addresses semantic page segmentation methods, including CSS‑based parsing, link density heuristics, and vision‑based models such as LayoutLM.

AIMultimodalSearchcontent understandinglarge-scale indexingImage Embedding
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.