Artificial Intelligence 11 min read

Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)

The authors introduce a feature‑fusion pipeline and a Residual Dense Net that leverages multi‑frame face embeddings to identify persons in iQIYI‑VID‑2019 videos, achieving 0.9035 mAP (second place) with only ≈0.5 GFLOPs and processing the full test set in minutes.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)

Abstract: This work proposes a novel dataset feature‑fusion strategy and a residual fully‑connected network (Residual Dense Net) for video person identification. Using only face features extracted from the iQIYI‑VID‑2019 dataset, the method achieves an mAP of 0.9035, ranking second in the ACM MM Multi‑Media Challenge. The model is lightweight (≈0.5 GFLOPs) and processes the whole test set in 2–5 minutes.

Introduction: Video dominates online consumption, and person identification in video is challenging due to variations in pose, face clarity, clothing, etc. The iQIYI‑VID‑2019 dataset is the largest multimodal person‑identification dataset, containing >200 hours of video, >200 k clips, and 10 034 identities. Official pre‑extracted features (head, body, video, face) are provided; face features are obtained via SSH detection followed by ArcFace embedding.

The official baseline fuses multimodal features using NetVLAD and self‑attention but does not exploit temporal information between frames.

Proposed Approach: The team leverages multi‑frame information fusion to enhance robustness and designs a Residual Dense Net that adds shortcut connections to a fully‑connected architecture, reducing over‑fitting and improving feature utilization.

Algorithm Overview: The processing pipeline (illustrated in Figure 1 of the original document) includes feature extraction, multi‑frame fusion, residual dense network training, and ensemble inference.

1. Pre‑experiment: The original training and validation sets were merged and re‑split 9:1. Accuracy on the validation set for each feature type was measured: head ≈60 %, body ≈35 %, audio ≈25 %, face ≈81–88 %. Consequently, face features were selected as the primary modality, with body features used only as a fallback.

Face features have high accuracy but are missing in some videos.

When face is missing, head features are similar to face and not useful as auxiliary.

Audio features are heavily affected by background noise and voice‑over changes.

2. Model Design: A residual fully‑connected model is built by inserting shortcut connections between the first three layers of a deep FC network, together with dropout and batch‑normalization. This architecture captures both shallow and deep information while mitigating over‑fitting.

3. Data Processing: Face features are divided into three subsets based on quality scores (all, >20, >40). Each subset undergoes data augmentation: frames of each video are shuffled and merged with 1–4 other frames to create new fused features, which are then combined with the original features. This reduces intra‑class variance caused by different scenes, makeup, lighting, and increases the number of frames for low‑sample classes.

4. Training Procedure: Three separate models are trained on the three subsets. Training uses the Adam optimizer, cross‑entropy loss, an initial learning rate of 0.01 decayed by 0.7 each epoch, for a total of eight epochs.

5. Testing Procedure: Test data are augmented in the same way as training data. For each video, the top 70 % high‑quality face frames are selected, predictions are averaged across frames, and the top‑100 classes per video are output.

Results and Prospects: Single‑model mAP = 0.875. Ensemble of the three subset models = 0.889. Adding feature fusion and data augmentation = 0.9019. Fusion of the residual model with a standard FC network = 0.9035, the best score among teams using only official features. The proposed feature‑residual classification network achieves high accuracy with modest computational cost and can be readily applied to other modalities.

Future directions include incorporating textual metadata (comments, titles) as auxiliary features, inserting attention modules with sigmoid‑activated branches, applying label smoothing or MixUp to alleviate class imbalance, and further optimizing feature extraction pipelines.

multimodal learningfeature fusionperson recognitioniQIYI-VID-2019residual dense networkvideo identification
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.