Advances in Video Multimodal Retrieval: Video‑Text Semantic Search and Video‑Video Same‑Source Search
This article presents Ant Group's multimodal research on video retrieval, detailing video‑text semantic search and video‑video same‑source search, introducing a large Chinese pre‑training dataset, novel pre‑training, hard‑sample mining, fine‑grained modeling techniques, and an efficient end‑to‑end copyright detection framework.
The presentation shares the Ant Group multimodal team's research achievements over the past year in video multimodal retrieval, focusing on two main directions: improving video‑text semantic retrieval and enabling efficient video‑video same‑source retrieval.
Overview – Video multimodal retrieval is widely used internally and includes video‑text semantic search and video‑video same‑source search.
Video‑Text Semantic Retrieval – Three key improvements are described:
Video‑text pre‑training on a newly constructed Chinese dataset (CNVid‑3.5M) that boosts R@sum by 24.5%.
Hard‑sample mining (HSCL) that adds ~8.1% R@sum improvement, with both curriculum‑learning and adaptive methods (DMAE, NegNCE).
Fine‑grained modeling (S3, TPM‑CL) that introduces token‑importance prediction and ordered‑pair loss, yielding an additional ~2.8% gain.
The CNVid‑3.5M dataset contains 3.5 million high‑quality video‑text pairs after filtering low‑quality pairs using CLIP similarity thresholds.
Hard‑Sample Mining Details – Two strategies are explored: manually scheduled curriculum learning based on contrastive similarity, and self‑adaptive methods (DMAE for expanding hard negatives and NegNCE for focusing on difficult negatives), together delivering ~5% R@sum improvement on both Chinese and English benchmarks.
Fine‑Grained Modeling Details – The S3 framework combines Mask Significant Semantic Modeling (MSSM) and Local‑Vision‑Word Matching (LVWM) to force the model to rely on important tokens and visual regions, achieving consistent gains across ResNet‑50 and PVT backbones.
Video‑Video Same‑Source Retrieval – A proprietary end‑to‑end segment‑matching and localization method (SPD) is introduced, reducing storage by 85% and accelerating inference 18× while improving F1 by 2.78% compared to uniform‑frame baselines.
The system extracts key frames, builds a frame‑level feature library, and matches query frames against the library using a similarity matrix on which a YOLO‑based pattern detector locates infringing segments. Joint training of the key‑frame extractor and SPD module further reduces storage and improves accuracy.
Summary – The work demonstrates that large‑scale video‑text pre‑training, hard‑sample mining, fine‑grained token‑level modeling, and efficient key‑frame‑based same‑source detection together substantially advance video multimodal retrieval performance and cost efficiency.
Q&A – The session addresses practical questions about key‑frame annotation, model open‑source plans, integration of multimodal embeddings into recommendation pipelines, storage media, real‑time inference, and the vector database (internal "Qianxun" similar to Faiss) used for large‑scale retrieval.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.