Artificial Intelligence 16 min read

Short Video Analysis for Local Life Scenarios: Techniques and Practices at Meituan

This article presents Meituan's AI‑driven short‑video analysis pipeline for local‑life scenarios, covering industry trends, multi‑label classification, intelligent cover selection, and video generation, and discusses model construction, label‑system expansion, continuous data iteration, and practical applications in restaurant and hotel domains.

DataFunSummit
DataFunSummit
DataFunSummit
Short Video Analysis for Local Life Scenarios: Techniques and Practices at Meituan

With the rapid growth of video content driven by advances in hardware and software, video data now contains massive information that can be leveraged for creation, moderation, editing, and distribution. Meituan adopts a "scenario‑driven" AI approach, applying computer‑vision techniques across its diverse local‑life services such as food, accommodation, travel, shopping, and entertainment.

Background and Motivation – Video has become a dominant medium, and Meituan seeks to extract richer insights from user‑generated short videos, especially in review contexts where video can convey more vivid information than text and images alone.

Video Multi‑Label Classification – To overcome the limitations of using only metadata or click behavior, Meituan builds a multi‑label classification system that tags video content explicitly. Challenges include constructing a robust label taxonomy, ensuring high precision while expanding coverage, and enabling incremental learning for evolving content.

Initial Model Construction – Public datasets such as YouTube‑8M are used to pre‑train a teacher model, which then generates pseudo‑labels on unlabeled Meituan data. After confidence filtering and label propagation, a student model is fine‑tuned on the cleaned data, iterating this process to improve accuracy.

Label‑System Expansion – Horizontal expansion adds new tags by clustering feature embeddings of unlabelled videos and manually assigning relevant labels; vertical refinement introduces fine‑grained tags (e.g., specific dishes) by leveraging existing fine‑grained image classification models.

Efficient Continuous Data Iteration – A loop of online feedback, active learning (selecting high‑uncertainty samples), and weak supervision continuously enriches the training set, allowing the model to adapt to new content and distribution shifts.

Intelligent Video Cover – Two strategies are employed: (1) General cover selection based on importance scores derived from visual quality, motion stability, and information density; (2) Semantic cover selection that aligns cover frames with user intent using multimodal (visual‑text) weak supervision to generate "segment‑label" pairs.

Video Generation – A hierarchical pipeline processes diverse raw assets (images, video clips, audio, text) to produce promotional short videos. In the restaurant scenario, AI selects high‑quality images, performs aesthetic ranking, and applies smart cropping and animation; in the hotel scenario, additional audio beat detection and script‑guided sequencing enhance the final output.

Conclusion and Outlook – As AI and communication technologies (e.g., 5G) evolve, short video will play an increasingly vital role in local‑life services. Future work will focus on unsupervised, self‑supervised, and multimodal content understanding to further unlock value from massive video data.

computer visionAI+video generationvideo analysisMeituanintelligent covermultilabel classification
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.