Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search
The article describes a multimodal AI video moderation system built on Alibaba’s Chinese‑CLIP model and hybrid RedisSearch/ElasticSearch vector databases, enabling real‑time violation detection and historical recall, with fine‑tuned black‑market ad detection, FP16 quantization, and OpenVINO acceleration to boost speed and cut storage.
This article details the implementation of a video content moderation system that leverages multimodal AI technology to detect prohibited content in videos. The system addresses two key business requirements: real-time detection of violations in incoming videos and historical data recall when new violation standards are established.
Business Background: The moderation system needs to identify various types of prohibited content including defamatory videos about leaders, advertisements for illegal gambling/pornography websites, and malicious misinformation about political events. The initial approach using simple image feature matching faced challenges with limited generalization capability and high storage costs (approximately 3GB daily for 200k videos with 8 screenshots each).
Solution - Multimodal Detection: The system employs Alibaba's Chinese-CLIP model, which aligns image and text feature representations. This enables both image-to-image and image-to-text matching, significantly improving recall rates. The model architecture consists of a RoBERTa-based text encoder (12 layers, 102M parameters) and a Vision Transformer visual encoder (12 layers, 86M parameters). The training process aligns paired image-text embeddings using cosine similarity loss.
Implementation Pipeline: 1) Build a violation sample database with both images and text descriptions; 2) For incoming videos, extract keyframe features and perform KNN retrieval against the violation database; 3) Apply threshold-based classification for direct violation detection or suspected violation flagging; 4) Store features in historical vector database for future recall.
Model Fine-tuning: For black market advertising detection, the model was fine-tuned using 1K images (positive, negative, and suspected samples) augmented to 12K. The fine-tuned model achieved significant improvements: image-to-text recall improved from 67.79% to 98.88% (+45.86%), and text-to-image recall improved from 37.68% to 88.53% (+50.85%).
Vector Search Architecture: The system uses different vector databases for different scenarios: RedisSearch for real-time detection (small dataset, high speed requirement) and ElasticSearch for historical recall (large dataset, lower speed requirement). This hybrid approach optimizes both performance and cost.
Performance Optimization: Model inference was accelerated using OpenVINO on Intel CPU, achieving a 226.95% speed improvement (from 368.72ms to 112.78ms per image) while reducing storage by half through FP16 quantization.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.