Experience Report of the 2018 iQIYI Multimodal Video Person Identification Challenge (WitcheR Team)
The WitcheR team won the 2018 iQIYI multimodal video person identification challenge by building a fast pipeline that combined a custom face‑and‑keypoint detector, ArcFace‑trained face embeddings, scene classification, and a three‑layer MLP with several training tricks, achieving a final mAP of 88.6 % and demonstrating the value of rapid idea validation and open‑sourced code for future challenges.
In 2018 iQIYI partnered with the China Conference on Pattern Recognition and Computer Vision (PRCV2018) to launch the first multimodal video person identification challenge. Among 397 participating teams from top universities and companies worldwide, the WitcheR team won the championship.
The team consisted of three members (IBUG Jiankang Deng, JackYu, and the author). Collaboration was informal with no strict division of labor; ideas were freely exchanged and the author consolidated the implementation and validation.
The challenge provided the largest video person dataset to date, IQIYI_V, containing 500,000 video clips (1–30 s each) of 5,000 (≈4,934 after cleaning) celebrities. The dataset is both large and clean, making it an excellent benchmark for model performance.
Potential cues for video person retrieval included face recognition, head detection, person re‑identification, scene classification, voice, and pose models. The team ultimately used only face recognition and scene classification because head‑detection data were lacking, ReID models did not generalize well, voice and pose models were unfamiliar, and scene models could complement missing faces.
Initial experiments evaluated frame sampling rate, detector choice (MTCNN vs a custom detector), face feature aggregation methods, and ways to improve retrieval after obtaining video features. Strategies such as varying frame stride, selecting a more accurate detector, and testing different aggregation schemes were explored.
The team deployed a custom one‑stage detector that simultaneously detects faces and keypoints for alignment, achieving roughly a 1‑point mAP gain over MTCNN on the Phase‑1 validation set. Face recognition models were trained on MS1M‑ArcFace (emore) combined with Glint‑Asia data, using the ArcFace loss recently accepted as an oral paper at CVPR 2019.
Video features were extracted by sampling one frame every three frames (~8 FPS) and averaging the features. Additional experiments included removing blurry faces based on feature norm (improved), flip augmentation (no gain), color jitter (degraded), and pose‑based grouping (no gain). The first submission achieved a test mAP of 79.8, briefly ranking first among fewer than ten submissions.
To further boost performance, a simple three‑layer MLP was trained on the 512‑dimensional video embeddings, raising mAP to 82.9. Incorporating seven tricks—three‑layer depth, 1024‑unit width, PReLU activation, shortcut connections, batch normalization (without dropout), large batch size (4096 per GPU), and a fixed‑gamma BN before the softmax—raised mAP to 86.4.
Model fusion was then applied: four face recognition models trained with different random seeds were each passed through the same MLP, and their predictions were weighted, resulting in an mAP of 88.2.
For videos lacking detectable faces, a scene classification model based on a ResNet‑152 pretrained on ImageNet‑11k + Place365 was fine‑tuned on extracted frames, pushing the final mAP to 88.6.
The results are summarized in the accompanying figure (see original source).
During the final days of the competition, the second‑place team surged close to the top score, creating tension. Nevertheless, the WitcheR team maintained their lead and secured first place overall.
Key lessons include the importance of rapid idea validation under limited resources. Efficient pipelines were built by: (1) hashing video IDs to split detection across multiple GPUs and caching intermediate results; (2) storing per‑model video features for reuse in MLP training; (3) serializing trained MLP models and prediction probabilities for ensemble fusion.
Looking ahead, the 2019 iQIYI video person recognition challenge features an upgraded dataset (iQIYI‑VID‑2019) with ~5,000 new short‑video IDs, 10,000 celebrities, 200 hours of footage, and 200,000 clips, presenting a more realistic media scenario. The team plans to participate again, and all techniques and code have been open‑sourced in the InsightFace repository.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.