Artificial Intelligence 12 min read

Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge

Watchdog team won TOP1 in iQIYI & ACMMM2019 multimodal video person recognition challenge using pre‑extracted multimodal features, a 2048‑dim classifier with BCE loss, re‑ranking, DALI‑accelerated re‑detection, fine‑tuned InsightFace, and multi‑model ensembling achieving ~91% test accuracy.

iQIYI Technical Product Team

Jun 28, 2019

Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge

The iQIYI & ACMMM2019 multimodal video person recognition challenge has concluded, and the Watchdog team achieved the overall TOP1 rank. This article summarizes the team's experience and technical solutions.

Initial pure‑face pipeline : The team first tried the classic pipeline of extracting frames, detecting and aligning faces, extracting face features, and classifying with an MLP. However, the competition required the submission to run inside a Docker container for less than 8 hours, making the full pipeline infeasible even after reducing the frame rate.

Official pre‑extracted features : The organizers provided four types of features for each video segment:

face: n × 512 (n = number of faces)

head: n × 512

audio: 512

body: n × 512

Using these features, the team only needed to design a classifier.

Dataset split : The original training and validation sets were merged and re‑split into 10 folds (each validation fold = 10 % of the data). Experiments were first run on the validation folds; only when a clear improvement was observed did the team submit results to the test server.

Feature classification :

Pure face scheme : A three‑layer fully‑connected network (each layer 1024 units) with BatchNorm and ReLU, followed by a softmax loss, was trained on the 512‑dim face features. This achieved roughly 87 % validation score.

Multimodal feature fusion : A separate head‑feature classifier reached 52 % validation score; weighted ensemble of the two gave about a 1 % boost. The team then concatenated the 512‑dim face feature with the averaged head, audio, and body features (each 512‑dim) to form a 2048‑dim vector. The network was modified as follows:

All FC layers expanded to 2048 units.

ReLU replaced by PReLU.

Softmax loss replaced by BCE loss for 10 034 classes (inspired by the Humpback Whale Identification competition).

Input normalization and extensive data augmentation (random drop, random noise, etc.).

Shortcut‑connection and attention modules were tried but discarded due to negligible impact.

These changes raised the validation score to 91 %.

Re‑ranking : Inspired by re‑ranking ideas, the team added an exponential boost to the top‑k (k ≤ 10) class scores for each video: s(j) = s(j) + exp(-rank(j)) if rank(j) <= 10 This yielded 92.2 % on validation and 87.3 % on the test set.

Re‑detection and alignment : To meet the runtime limit, the team replaced the default DataLoader with NVIDIA DALI, achieving ~380 fps image loading (270 W+ images in the test set). They used the provided detection boxes, enlarged them, cropped the corresponding regions, resized to 192 × 192, and applied RetinaFace for detection and alignment (~180 fps). Videos with a large number of faces were down‑sampled to a threshold of 30, reducing processed faces from ~200 W to ~150 W with minimal impact on accuracy. Re‑detection reduced the total face count by ~3.37 % but improved overall quality.

Feature extraction & classification (second stage) : The newly aligned faces were used to fine‑tune InsightFace for feature extraction. The resulting features were concatenated with the previous 2048‑dim vector to form a 2560‑dim representation, which was fed to the classifier. This achieved 93.2 % validation score and ~89 % test score.

Model fusion : Multiple models trained on the cross‑validation splits were ensembled. A 3‑model ensemble reached 90.62 % on the test set, while a 6‑model ensemble achieved 91.13 %. Fusion of the 2560‑dim and 2048‑dim results gave a final score of 91.14 %, securing the TOP1 position.

Postscript : The team notes several avenues for future work, such as more advanced multimodal fusion methods (e.g., CentralNet, LMF) and leveraging detection confidence scores as weighted contributions. They also reflect on time‑management challenges during the competition.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal Learning feature fusion model ensemble Re‑ranking video person recognition

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.