Multimodal Video Retrieval Solution for iQIYI Challenge: Feature Fusion and Model Ensemble
The ‘One Name’ team from Nanjing University achieved a MAP of 0.8986 and third place in the iQIYI multimodal video retrieval challenge by fusing official face embeddings with scene features, using channel‑attention‑based video feature fusion, a multimodal SE‑ResNeXt module, and a carefully partitioned model ensemble.
The "One Name" team, composed of four members from Nanjing University R&L Lab and supervised by Professor Huo Jing, achieved a MAP score of 0.8986 and ranked third in the iQIYI multimodal challenge. Their code is open‑source on GitHub.
The iQIYI multimodal challenge requires retrieving video segments that correspond to specific person categories. Performance is evaluated using Mean Average Precision (MAP).
The dataset contains nearly 200,000 video clips featuring 10,034 individuals, with each clip having a single primary person. Officially provided features include face, head, body, and voice embeddings, but these were not aligned or fine‑tuned.
Because re‑extracting and aligning face features within the Docker evaluation environment would be too time‑consuming, the team retained the official face embeddings and additionally extracted scene features as auxiliary information.
The overall solution is divided into three modules: video feature fusion, multimodal feature fusion, and model ensemble.
In the video feature fusion module, the team adopted the Channel Attention mechanism from DANet to measure similarity across frames, enhancing similar features and suppressing outliers. The network architecture follows the DANet design.
To obtain an effective representation from multiple frames, they employed a CNN to extract portrait features from each frame and then used an aggregation module to learn a cumulative feature vector. Experiments showed this approach outperforms handcrafted methods such as quality‑score weighted averaging.
For the loss function, they combined Additive Angular Margin Loss with Focal Loss. Since the angular margin loss normalizes features and discards magnitude information (which correlates with quality scores), the team concatenated the original quality scores with the normalized features before loss computation.
The multimodal feature fusion module extracts scene features using an SE‑ResNeXt backbone (a ResNet with increased cardinality and attention). During training, one frame per video is sampled and the model is trained for 20 epochs with a cosine‑annealing learning rate schedule. At test time, the same sampling strategy is applied.
Scene features are reduced to 128 dimensions and concatenated with the face embeddings. The combined vector is fed into a three‑layer MLP to produce the final prediction.
For model ensemble, the team avoided naive bagging because many video IDs appear only once or twice, leading to severe ID loss. Instead, they partitioned the feature space, discarding certain dimensions (illustrated as the white region) and retaining others (green region) to train multiple sub‑models, which are then ensembled.
In inference, videos lacking face data rely solely on scene predictions. For the lowest 1% quality‑score videos, multimodal and scene predictions are weighted to obtain the final result.
Experience summary: participants should first clarify the task and conduct thorough data analysis. Rapid, high‑quality implementation of ideas is crucial; excessive early hyper‑parameter tuning can waste time, and promising ideas may not be realizable later in the competition.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.