Artificial Intelligence 15 min read

Multimodal Soft‑Porn Detection for Short Videos: Models, Challenges, and Lessons Learned

The article describes iQIYI's multimodal soft‑porn detection system for short videos, covering challenges like subjective definitions and class imbalance, and detailing text (Convolutional Bi‑LSTM), image (Xception‑CBAM), video (NeXtVLAD) models, integration strategies, key takeaways, and future improvements.

iQIYI Technical Product Team

iQIYI Technical Product Team

Feb 21, 2019

Multimodal Soft‑Porn Detection for Short Videos: Models, Challenges, and Lessons Learned

With the explosive growth of user‑generated content, short‑video platforms occupy an increasing share of users' time. iQIYI, as a leading Chinese video media service, bears the social responsibility of guiding user values correctly. Detecting and blocking low‑quality or soft‑pornographic content is therefore a critical task. This article introduces iQIYI’s multimodal soft‑porn detection technology from a technical perspective.

Challenges

• Standardization difficulty: Human definitions of soft porn are highly subjective; various scenes (e.g., revealing clothing, kissing, suggestive poses) lead to inconsistent labeling standards.

• Severe class imbalance: Soft‑porn videos constitute a tiny fraction of the overall video pool, making model training difficult.

• Multimodal learning: Determining whether a video is soft porn requires integrating textual metadata, cover images, video frames, and audio, which calls for a multi‑task, multimodal feature‑fusion model.

1. Text Soft‑Porn Detection

Common text classification baselines such as FastText, TextCNN, and TextRNN are discussed. iQIYI adopts a Convolutional Bi‑LSTM architecture: several convolutional layers first reduce the dimensionality of token embeddings, followed by an LSTM that captures sequential semantics. This combines the local modeling strength of CNNs with the global sequence modeling of LSTMs.

Figure 1. Text Soft‑Porn Classification Model – Convolutional Bi‑LSTM

2. Cover‑Image Soft‑Porn Detection

The image pipeline fine‑tunes a pre‑trained ImageNet model (Xception) to extract frame‑level features. To preserve spatial and channel information, a CBAM (Convolutional Block Attention Module) is inserted between Xception block‑14 and global average pooling, yielding an Xception‑CBAM model.

Figure 2. Xception Model Structure

Figure 3. Xception‑CBAM Cover‑Image Model

3. Video‑Content Soft‑Porn Detection

The video pipeline consists of three modules: feature extraction, multimodal feature fusion, and classification. For each video, ten RGB key frames are sampled and processed by the fine‑tuned Xception‑CBAM to obtain frame‑level features. Various fusion strategies (Bi‑LSTM, Bi‑LSTM + Attention, NetVLAD, NeXtVLAD) were evaluated, with NeXtVLAD ultimately selected.

Figure 4. NetVLAD Feature Fusion

Figure 5. Difference Between NetVLAD and NeXtVLAD

Figure 6. Final Video Soft‑Porn Model Based on NeXtVLAD

4. Integrated Model

Two integration strategies were explored:

• Pipeline: Sequentially apply text → image → video classifiers.

• End‑to‑end multi‑task learning: A single model jointly predicts four binary tasks (text, image, video, feed soft‑porn).

Figure 7. Pipeline Integration Structure

Figure 8. End‑to‑End Multi‑Task Integration Structure

The table below compares the two approaches:

Pipeline

End‑to‑End

Sample Annotation

Annotations for text, image, and video models can be decoupled.

Each sample must contain annotations for all three modalities, increasing cost.

Feature Interaction

Features are independent across modalities.

Features interact, improving detection performance.

Model Deployment

Three sub‑models; higher resource consumption but flexible.

Single model; lower deployment cost but less flexible.

Soft‑Porn Recognition

Sequential; error in any sub‑model propagates.

All modalities evaluated jointly.

Key Takeaways

Early data quality control is crucial; continuous QA improves annotation reliability.

Data augmentation and label correction significantly boost performance for the scarce soft‑porn class.

Class imbalance is mitigated with focal loss and higher weighting for positive samples.

Model architecture improvements are often outweighed by better data and feature engineering.

Understanding data distribution and conducting thorough bad‑case analysis guide effective model upgrades.

Documenting experiments (purpose, conclusion, next steps) accelerates iteration.

Future Optimizations

Upgrade the text model beyond a simple Convolutional LSTM and explore richer token embeddings.

Leverage video metadata (uploader, duration, description, category) for additional signals.

Expand the multimodal multi‑task model with more video‑labeled samples and a more effective feature‑fusion module.

References

Armand J. et al., “Bag of Tricks for Efficient Text Classification,” EACL 2017.

Kim Y., “Convolutional Neural Networks for Sentence Classification,” EMNLP 2014.

Liu D. et al., “Recurrent Neural Network for Text Classification with Multi‑Task Learning,” IJCAI 2016.

He K. et al., “Deep Residual Learning for Image Recognition,” CVPR 2016.

Chollet F., “Xception: Deep Learning with Depth‑wise Separable Convolutions,” CVPR 2017.

Hu J. et al., “Squeeze‑and‑Excitation Networks,” CVPR 2018.

Woo S. et al., “CBAM: Convolutional Block Attention Module,” ECCV 2018.

Arandjelović R. et al., “NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,” CVPR 2016.

Lin R. et al., “NeXtVLAD: An Efficient Neural Network to Aggregate Frame‑level Features for Large‑scale Video Classification,” ECCV 2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI content moderation multimodal detection soft porn video classification

iQIYI Technical Product Team

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Sign in to comment