Overview of Meituan's Selected CVPR 2024 Papers and Online Sharing Event
Meituan's tech team highlights seven CVPR 2024 papers—spanning OCR pre‑training, long‑tail semi‑supervised learning, visual AIGC, audio‑visual segmentation and synthetic‑data detection—provides detailed abstracts and experimental results, and announces an online author‑talk session on June 27.
01. ODM: A Text‑Image Further Alignment Pre‑training Approach for Scene Text Detection and Spotting
Authors: Chen Duan, Pei Fu, Shan Guo, Qianyi Jiang, Xiaoming Wei (Meituan)
PDF: https://arxiv.org/pdf/2403.00303
Existing OCR pre‑training methods based on Masked Image Modeling (MIM) or Masked Language Modeling (MLM) have limited ability to align text prompts with the corresponding image regions. The paper introduces OCR‑Text Destylization Modeling (ODM), which converts diverse font styles in images to a unified style guided by text prompts, thereby improving alignment for scene‑text detection and end‑to‑end recognition. ODM also proposes a novel label‑generation strategy combined with a text controller module, reducing annotation cost and allowing large amounts of unlabeled data to be used for pre‑training. Experiments on several public OCR datasets demonstrate significant performance gains over prior pre‑training approaches.
02. BEM: Balanced and Entropy‑based Mix for Long‑Tailed Semi‑Supervised Learning
Authors: Hongwei Zheng, Linyuan Zhou, Han Li (SJTU), Jinming Su, Xiaoming Wei, Xiaoming Xu (Meituan)
PDF: https://arxiv.org/pdf/2404.01179
Long‑tail semi‑supervised learning (LTSSL) suffers from class imbalance and high uncertainty for minority classes. Traditional batch‑mixing cannot rebalance class distributions effectively. BEM addresses this by constructing a class‑balanced mix library and applying an entropy‑based sampling, selection, and loss scheme that simultaneously rebalances data quantity and class uncertainty. Empirical results on multiple benchmarks show that BEM consistently improves performance, complements existing rebalancing methods, and works across different data distributions, datasets, and SSL learners.
03. Animating General Image with Large Visual Motion Model
Authors: Dengsheng Chen, Xiaoming Wei (Meituan), Xiaolin Wei
PDF: https://arxiv.org/pdf/2406.00973
The paper proposes a Large Visual Motion Model (LVMM) to overcome the limited scope of optical‑flow‑based image animation. LVMM consists of a flow‑prediction network (P), a neural rendering network (R), compression and reconstruction networks (E and D), and a latent diffusion model (e). Training proceeds in three stages: (1) P predicts optical flow between two frames; (2) R renders realistic motion using separate high‑frequency and low‑frequency flow branches; (3) after fixing P, E and D compress visual and motion features into distinct latent spaces, enabling e to generate coherent dynamic content. Decoupling visual and motion cues improves generalization, and experiments show that LVMM produces visually appealing motion while preserving content fidelity.
04. CustomListener: Text‑guided Responsive Interaction for User‑friendly Listening Head Generation
Authors: Xi Liu*, Ying Guo*, Cheng Zhen, Tong Li, Yingying Ao, Pengfei Yan (Meituan)
PDF: https://customlistener.github.io/
Existing digital‑human generation limits Listener control to predefined emotion tags. CustomListener enables arbitrary free‑text specifications of Listener attributes (identity, personality, habits, social relations). The pipeline first uses ChatGPT to generate a static textual prior from user‑defined text and Speaker content. An SDP module converts this prior into dynamic portrait tokens that encode rhythm and amplitude cues responsive to real‑time Speaker audio, video, and actions. A PGG module enforces inter‑segment consistency by generating motion priors based on token similarity. Finally, a diffusion model conditioned on these tokens produces realistic Listener reactions aligned with the Speaker.
05. Cooperation Does Matter: Exploring Multi‑Order Bilateral Relations for Audio‑Visual Segmentation
Authors: Qi Yang (UCAS, CASIA), Xing Nie (UCAS, CASIA), Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan (Meituan), Shiming Xiang (UCAS, CASIA)
PDF: https://arxiv.org/pdf/2312.06462.pdf
The Audio‑Visual Segmentation (AVS) task requires pixel‑level segmentation of sounding objects, demanding audio‑driven visual understanding. COMBO introduces a Transformer framework that models three bilateral entanglements: pixel, modality, and temporal. A twin‑encoder generates precise visual features beyond generic Segment Anything Model (SAM) capabilities. A bilateral fusion module aligns visual features to audio cues and vice‑versa, focusing both modalities on sounding targets. An adaptive frame‑consistency loss leverages past and future frame information for temporal coherence. Comprehensive experiments on AVSBench‑Object and AVSBench‑Semantic show that COMBO outperforms state‑of‑the‑art methods.
06. Intelligent Grimm – Open‑ended Visual Storytelling via Latent Diffusion Models
Authors: Chang Liu* (SJTU, Shanghai AI Lab), Haoning Wu* (SJTU), Yujie Zhong (Meituan), Xiaoyun Zhang (SJTU), Yanfeng Wang (SJTU, Shanghai AI Lab), Weidi Xie (SJTU, Shanghai AI Lab)
PDF: https://arxiv.org/pdf/2306.00973
The paper tackles open‑ended visual storytelling, generating coherent image sequences from a narrative. StoryGen is an autoregressive image generator equipped with a visual‑language context module that conditions each frame on the textual prompt and previous image‑caption pairs. To mitigate data scarcity, the authors construct StorySalon, a large‑scale dataset of paired image‑text sequences harvested from online videos and e‑books, covering diverse characters, plots, and artistic styles. Quantitative results demonstrate StoryGen’s superiority over baselines and its ability to generalize to unseen characters without additional fine‑tuning.
07. InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
Authors: Chengjian Feng, Yujie Zhong, Zequn Jie (Meituan), Weidi Xie (SJTU), Lin Ma (Meituan)
PDF: https://arxiv.org/pdf/2402.05937
Recent text‑to‑image diffusion models generate photorealistic images but lack instance‑level annotations required for training detectors. InstaGen integrates a grounding detection head into a pretrained generative model, aligning class‑name text embeddings with diffusion‑derived visual features. A self‑training scheme expands coverage to classes absent from existing detectors. Experiments show that using InstaGen‑generated data improves open‑vocabulary detection by +4.5 AP and boosts performance in low‑data regimes by +1.2 ~ 5.2 AP, surpassing current state‑of‑the‑art methods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
