Artificial Intelligence 6 min read

iQIYI Multimodal Person Recognition Competition: 91.14% Accuracy Achieved by BUPT Team

After a three‑month contest co‑hosted by iQIYI and ACM MM, 255 teams competed on the challenging iQIYI‑VID‑2019 multimodal dataset, and the BUPT Automation School team won with a 91.14% person‑recognition accuracy, advancing the field and enhancing iQIYI’s video recommendation and AI services.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Multimodal Person Recognition Competition: 91.14% Accuracy Achieved by BUPT Team

After three months of competition, the iQIYI‑sponsored multimodal person recognition contest, co‑organized with the ACM International Conference on Multimedia (ACM MM), concluded recently.

The event attracted 255 teams from top universities such as Carnegie Mellon, UCL, Exeter, Tsinghua, Peking University, as well as leading companies including Baidu, ZTE, JD.com, Meitu, and NVIDIA. The top three positions were taken by teams from Beijing University of Posts and Telecommunications (BUPT) Automation School, BUPT Network Intelligent Center, and Nanjing University Computer Science Department.

The champion team from BUPT Automation School raised multimodal video person recognition accuracy to 91.14% , improving over the previous year’s 88.65% by 2.5 percentage points.

Globally, many institutions release video datasets to tackle recognition challenges. Examples include Oxford’s VoxCeleb2 (6,000+ speakers, 150,000 videos), CUHK‑SenseTime’s CSM dataset (1,218 identities, 127,000 videos), and Tel‑Aviv University’s YouTube‑Faces DB (3,425 clips, 1,595 identities).

For this competition, iQIYI provided a rigorously annotated, challenging multimodal dataset – iQIYI‑VID‑2019 – containing 10,000 celebrity identities, 200 hours of video, and 200,000 clips across four modalities: face, head, body, and voice. Participants could use the provided features without needing their own extraction resources, lowering hardware barriers and encouraging broader academic participation.

The winning BUPT team retrained an aligned face‑recognition model, applied data augmentation, and leveraged all five modality cues to build a multimodal classification model, achieving the reported 91.14% accuracy on the difficult dataset.

Higher multimodal recognition accuracy enables iQIYI to deliver better video consumption experiences, such as more precise recommendations for short‑form, UGC, or low‑quality footage, enhanced AI Radar interactions, improved HomeAI voice‑assistant experiences, and more accurate video editing for AIWorks and iQIYI’s content creation pipelines.

Liu Wenfeng, iQIYI’s CTO and head of the Infrastructure and Intelligent Content Distribution Business Group, emphasized that continuous breakthroughs in multimodal person recognition not only add value to iQIYI’s entertainment ecosystem but also advance research, technology transfer, and talent cultivation. iQIYI plans to keep collaborating with domestic and international academic institutions and industry leaders to push frontier technologies forward.

deep learningComputer VisionAI competitiondatasetiQIYIMultimodal Recognitionaccuracy
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.