Artificial Intelligence 16 min read

Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search

The article describes a multimodal AI video moderation system built on Alibaba’s Chinese‑CLIP model and hybrid RedisSearch/ElasticSearch vector databases, enabling real‑time violation detection and historical recall, with fine‑tuned black‑market ad detection, FP16 quantization, and OpenVINO acceleration to boost speed and cut storage.

Sohu Tech Products

Sep 25, 2024

Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search

This article details the implementation of a video content moderation system that leverages multimodal AI technology to detect prohibited content in videos. The system addresses two key business requirements: real-time detection of violations in incoming videos and historical data recall when new violation standards are established.

Business Background: The moderation system needs to identify various types of prohibited content including defamatory videos about leaders, advertisements for illegal gambling/pornography websites, and malicious misinformation about political events. The initial approach using simple image feature matching faced challenges with limited generalization capability and high storage costs (approximately 3GB daily for 200k videos with 8 screenshots each).

Solution - Multimodal Detection: The system employs Alibaba's Chinese-CLIP model, which aligns image and text feature representations. This enables both image-to-image and image-to-text matching, significantly improving recall rates. The model architecture consists of a RoBERTa-based text encoder (12 layers, 102M parameters) and a Vision Transformer visual encoder (12 layers, 86M parameters). The training process aligns paired image-text embeddings using cosine similarity loss.

Implementation Pipeline: 1) Build a violation sample database with both images and text descriptions; 2) For incoming videos, extract keyframe features and perform KNN retrieval against the violation database; 3) Apply threshold-based classification for direct violation detection or suspected violation flagging; 4) Store features in historical vector database for future recall.

Model Fine-tuning: For black market advertising detection, the model was fine-tuned using 1K images (positive, negative, and suspected samples) augmented to 12K. The fine-tuned model achieved significant improvements: image-to-text recall improved from 67.79% to 98.88% (+45.86%), and text-to-image recall improved from 37.68% to 88.53% (+50.85%).

Vector Search Architecture: The system uses different vector databases for different scenarios: RedisSearch for real-time detection (small dataset, high speed requirement) and ElasticSearch for historical recall (large dataset, lower speed requirement). This hybrid approach optimizes both performance and cost.

Performance Optimization: Model inference was accelerated using OpenVINO on Intel CPU, achieving a 226.95% speed improvement (from 368.72ms to 112.78ms per image) while reducing storage by half through FP16 quantization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI model fine-tuning content moderation RedisSearch Chinese CLIP image-text matching OpenVINO optimization

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.