Multimodal Soft‑Porn Detection for Short Videos: Models, Challenges, and Lessons Learned
The article describes iQIYI's multimodal soft‑porn detection system for short videos, covering challenges like subjective definitions and class imbalance, and detailing text (Convolutional Bi‑LSTM), image (Xception‑CBAM), video (NeXtVLAD) models, integration strategies, key takeaways, and future improvements.
With the explosive growth of user‑generated content, short‑video platforms occupy an increasing share of users' time. iQIYI, as a leading Chinese video media service, bears the social responsibility of guiding user values correctly. Detecting and blocking low‑quality or soft‑pornographic content is therefore a critical task. This article introduces iQIYI’s multimodal soft‑porn detection technology from a technical perspective.
Challenges
• Standardization difficulty: Human definitions of soft porn are highly subjective; various scenes (e.g., revealing clothing, kissing, suggestive poses) lead to inconsistent labeling standards.
• Severe class imbalance: Soft‑porn videos constitute a tiny fraction of the overall video pool, making model training difficult.
• Multimodal learning: Determining whether a video is soft porn requires integrating textual metadata, cover images, video frames, and audio, which calls for a multi‑task, multimodal feature‑fusion model.
1. Text Soft‑Porn Detection
Common text classification baselines such as FastText, TextCNN, and TextRNN are discussed. iQIYI adopts a Convolutional Bi‑LSTM architecture: several convolutional layers first reduce the dimensionality of token embeddings, followed by an LSTM that captures sequential semantics. This combines the local modeling strength of CNNs with the global sequence modeling of LSTMs.
Figure 1. Text Soft‑Porn Classification Model – Convolutional Bi‑LSTM
2. Cover‑Image Soft‑Porn Detection
The image pipeline fine‑tunes a pre‑trained ImageNet model (Xception) to extract frame‑level features. To preserve spatial and channel information, a CBAM (Convolutional Block Attention Module) is inserted between Xception block‑14 and global average pooling, yielding an Xception‑CBAM model.
Figure 2. Xception Model Structure
Figure 3. Xception‑CBAM Cover‑Image Model
3. Video‑Content Soft‑Porn Detection
The video pipeline consists of three modules: feature extraction, multimodal feature fusion, and classification. For each video, ten RGB key frames are sampled and processed by the fine‑tuned Xception‑CBAM to obtain frame‑level features. Various fusion strategies (Bi‑LSTM, Bi‑LSTM + Attention, NetVLAD, NeXtVLAD) were evaluated, with NeXtVLAD ultimately selected.
Figure 4. NetVLAD Feature Fusion
Figure 5. Difference Between NetVLAD and NeXtVLAD
Figure 6. Final Video Soft‑Porn Model Based on NeXtVLAD
4. Integrated Model
Two integration strategies were explored:
• Pipeline: Sequentially apply text → image → video classifiers.
• End‑to‑end multi‑task learning: A single model jointly predicts four binary tasks (text, image, video, feed soft‑porn).
Figure 7. Pipeline Integration Structure
Figure 8. End‑to‑End Multi‑Task Integration Structure
The table below compares the two approaches:
Pipeline
End‑to‑End
Sample Annotation
Annotations for text, image, and video models can be decoupled.
Each sample must contain annotations for all three modalities, increasing cost.
Feature Interaction
Features are independent across modalities.
Features interact, improving detection performance.
Model Deployment
Three sub‑models; higher resource consumption but flexible.
Single model; lower deployment cost but less flexible.
Soft‑Porn Recognition
Sequential; error in any sub‑model propagates.
All modalities evaluated jointly.
Key Takeaways
Early data quality control is crucial; continuous QA improves annotation reliability.
Data augmentation and label correction significantly boost performance for the scarce soft‑porn class.
Class imbalance is mitigated with focal loss and higher weighting for positive samples.
Model architecture improvements are often outweighed by better data and feature engineering.
Understanding data distribution and conducting thorough bad‑case analysis guide effective model upgrades.
Documenting experiments (purpose, conclusion, next steps) accelerates iteration.
Future Optimizations
Upgrade the text model beyond a simple Convolutional LSTM and explore richer token embeddings.
Leverage video metadata (uploader, duration, description, category) for additional signals.
Expand the multimodal multi‑task model with more video‑labeled samples and a more effective feature‑fusion module.
References
Armand J. et al., “Bag of Tricks for Efficient Text Classification,” EACL 2017.
Kim Y., “Convolutional Neural Networks for Sentence Classification,” EMNLP 2014.
Liu D. et al., “Recurrent Neural Network for Text Classification with Multi‑Task Learning,” IJCAI 2016.
He K. et al., “Deep Residual Learning for Image Recognition,” CVPR 2016.
Chollet F., “Xception: Deep Learning with Depth‑wise Separable Convolutions,” CVPR 2017.
Hu J. et al., “Squeeze‑and‑Excitation Networks,” CVPR 2018.
Woo S. et al., “CBAM: Convolutional Block Attention Module,” ECCV 2018.
Arandjelović R. et al., “NetVLAD: CNN Architecture for Weakly Supervised Place Recognition,” CVPR 2016.
Lin R. et al., “NeXtVLAD: An Efficient Neural Network to Aggregate Frame‑level Features for Large‑scale Video Classification,” ECCV 2018.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.