Artificial Intelligence 9 min read

iQIYI 2019 Multimodal Video Person Recognition Competition Report by Zheey Team

The Zheey team from Beijing University of Posts and Telecommunications tackled the iQIYI 2019 Multimodal Video Person Recognition Challenge with a three‑layer MLP on official face features, boosting a baseline 0.8742 to 0.8949 through model fusion, quality filtering and fine‑tuning, ultimately ranking sixth and open‑sourcing their code.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI 2019 Multimodal Video Person Recognition Competition Report by Zheey Team

The Zheey team consists of members from Beijing University of Posts and Telecommunications. Team leader Wang Wenzhe is a first‑year graduate student in the Computer Science School, focusing on multimedia content understanding and data mining. The other members are undergraduate students preparing for graduate studies or interning in the lab.

The team entered the competition primarily for learning purposes, aiming to deepen their understanding of cutting‑edge algorithms and knowledge in multimedia content understanding.

Division of labor: the team leader handled model building, optimization, and result submission, while other members contributed ideas through brainstorming.

The 2019 iQIYI Multimodal Video Person Recognition Challenge is essentially a person retrieval task: given a target person ID, retrieve the most likely video clips from a large test set and rank them by probability. The dataset contains about 200,000 video clips covering 10,034 target persons, with many difficult face samples and numerous distractor clips, requiring effective use of multimodal information.

Process: after analyzing the task and reviewing high‑scoring teams from the previous year, the team began downloading the dataset in early May and designing an initial model. For a quick baseline, they used the official face features and trained a three‑layer MLP, achieving an mAP of ~0.83 on the validation set. After merging training and validation data, they retrained the same model.

After familiarizing themselves with Docker, the baseline was submitted and achieved a score of 0.8742 on the test set, ranking first among submissions at that time, which motivated further effort.

Subsequent optimizations included multi‑model fusion, face‑quality filtering, model fine‑tuning, and weighting predictions by face‑quality scores. These strategies were validated on the official training set and finally trained on the combined data. The final test score reached 0.8949, ranking 6th overall. The complete code is publicly available at https://github.com/zhezheey/iQIYI-VID .

Model Input : official face features.

Model Architecture :

Three‑layer perceptron.

Hidden layer width: 4096, activation: ReLU.

Batch normalization and dropout applied.

Maximum batch size: 32768 (GPU: TITAN Xp).

Training Phase :

Interference handling: all distractor data assigned to class 10035.

Multi‑model strategies: (1) face‑quality score intervals (0‑200, 20‑200, 40‑200, 0‑60); (2) merge training and validation sets, shuffle with different random seeds, and re‑split with a 19:1 ratio.

Prediction Phase :

Face‑quality scoring: for clips with ≥8 face features, select the top 50% high‑quality faces; weight probabilities by quality scores.

Multi‑model: average predictions.

Final Results :

Baseline: 0.8742

Model fusion: 0.8861

Quality filtering: 0.8916

Model fine‑tuning: 0.8937

Weight addition: 0.8949

In the final week, the team experimented with additional strategies such as multimodal fusion, image‑classification feature extraction, and data augmentation. Some of these did not yield noticeable improvements, and a deeper MLP model was attempted but the submission failed due to time constraints, although later verification confirmed it could achieve a higher score.

Competition Summary : The team gained confidence and a basic understanding of multimedia content understanding. Key lessons include focusing on core strategies, establishing reasonable validation protocols, leveraging common tricks like data augmentation and model fusion, analyzing ineffective strategies before discarding them, and ensuring balanced task allocation among members.

Future Expectations : The team hopes to incorporate more multimodal information in future competitions, learn from top‑ranking teams, and improve the dataset, multimodal features, and competition workflow to foster further algorithmic innovation.

multimodalcompetitionMLPvideo retrievalface featuresperson recognition
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.