Artificial Intelligence 14 min read

UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms

This article presents a comprehensive overview of UC's information‑flow video tag recognition technology, detailing tag usage scenarios, the end‑to‑end system architecture, multi‑modal feature extraction, advanced deep‑learning models such as NextVlad, behavior and person tagging methods, and future research directions.

DataFunTalk
DataFunTalk
DataFunTalk
UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms

The presentation by Snow Meng (Alibaba) introduces UC information‑flow video tag recognition, explaining the overall architecture and multi‑modal methods that enable machines to understand key information in massive video streams.

Tag usage scenarios include building user profiles, assisting content recommendation, and generating vertical channels, allowing the system to infer user interests and diversify results.

Tag recognition system architecture covers the tag taxonomy (entity vs. semantic tags), continuous tag updates from search logs, crawlers, competitors, and hot events, and the management of tag relationships such as synonyms and associations.

The data pipeline uses a daemon that consumes Kafka streams of new or updated content, triggers a prediction service that downloads videos, extracts frames and text, obtains image, video, and audio features, feeds them into multiple NextVlad models, performs post‑processing (fusion, deduplication, expansion), and writes verified tags to HBase after human audit.

Tag recognition algorithms rely on the NextVlad model, an improvement over NetVlad, which aggregates multi‑frame features via a NetVlad‑style cluster‑based pooling, applies Context Gating and Squeeze‑and‑Excite gating, and uses a mixture‑of‑experts (MoE) classifier. Knowledge distillation with an on‑the‑fly naive ensemble trains three NextVlad sub‑models as students under a teacher model.

For behavior tags , optical flow is extracted and combined with 3D ConvNets (e.g., I3D) to capture spatio‑temporal dynamics, enabling recognition of actions such as jumping or falling.

Person tags are generated by detecting faces with MTCNN, extracting facial embeddings with InsightFace (trained with ArcFace loss), and matching against a large face library (Pangu). Age and gender are predicted via a separate classifier, and multi‑frame fusion improves robustness.

The future work section outlines plans to incorporate additional features like scene descriptors, object detection, OCR, and speech, as well as to leverage richer tag relationships for better labeling accuracy.

References to relevant papers are provided, and readers are invited to join the DataFun community for further AI and big‑data knowledge sharing.

computer visionDeep LearningRecommendation systemsmultimodal learningvideo tagging
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.