Meituan's 10 Papers at CVPR 2025 and ICLR 2025
This article presents concise summaries of ten selected ICLR 2025 and CVPR 2025 papers covering LLM alignment, temporal‑decay DPO, joint‑embedding predictive architecture, 4‑bit quantization, token‑focused VQA, universal visual segmentation, document understanding, fine‑grained spatio‑temporal modeling, visual quality evaluation, and ultra‑high‑resolution diffusion, and also announces face‑to‑face and online sharing sessions hosted by Meituan.
Enhancing LLM Alignment with Ternary Preferences
Paper type: Poster
PDF: https://openreview.net/forum?id=utkGLDSNOk
Abstract: The authors propose a ternary‑preference alignment method for large language models (LLM) to address limitations of binary preference models such as Bradley‑Terry, which struggle with noisy labels and tie cases. They introduce the TOBT model that explicitly models preference, non‑preference, and tie states, and design an algorithm that leverages ternary data to improve alignment robustness.
Results: Experiments on Mistral‑7B and Llama 3‑8B show accuracy gains of 6.5% (in‑distribution) and 3.2% (out‑of‑distribution) on datasets such as Ultrafeedback and Reward Bench. The method also outperforms DPO on MT‑Bench, Piqa, ARC, and MMLU, and remains superior on traditional binary alignment tasks.
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective
Paper type: Poster
PDF: https://openreview.net/forum?id=OspqtLVUN5
Abstract: Analysis of existing DPO methods shows they treat each reward in a sequence uniformly, ignoring temporal dynamics. KL‑divergence analysis on three open‑source models reveals that early tokens are more affected by DPO, and the effect diminishes with position, confirming prior findings that early tokens are critical for alignment.
Method: The paper introduces Temporal‑decay DPO (D²PO), which adds a time‑decay factor γ to dynamically adjust each reward’s contribution during training. D²PO preserves DPO’s efficiency while enhancing early‑token influence.
Results: D²PO achieves significant improvements on benchmarks such as AlpacaEval2, Arena‑Hard, and MT‑Bench without harming general capability.
Denoising with a Joint‑Embedding Predictive Architecture
Paper type: Poster
PDF: https://arxiv.org/pdf/2410.03755
Abstract: To address the bottleneck of multimodal generation models on continuous data, the authors propose D‑JEPA, which fuses the strengths of Joint‑Embedding Predictive Architecture (JEPA) and diffusion models. JEPA excels at self‑supervised representation learning, while diffusion models can model arbitrary distributions but lack integration with advanced representations.
Key contributions:
Reinterpret JEPA as a generalized masked‑image modeling framework and extend it to a continuous‑space autoregressive generation paradigm.
Introduce a FlowMatching‑based diffusion loss that retains JEPA’s structured representations while precisely modeling token‑level distributions.
Build a unified training pipeline that combines JEPA’s efficient representation learning with diffusion’s fine‑grained distribution modeling.
Results: Systematic experiments demonstrate that D‑JEPA outperforms single‑model baselines and existing fusion architectures in computational efficiency, generation quality, and cross‑modal transferability.
QQQ: Quality Quattuor‑Bit Quantization for Large Language Models
Paper type: Workshop
PDF: https://arxiv.org/pdf/2406.09904
Abstract: QQQ proposes a 4‑bit weight / 8‑bit activation (W4A8) quantization scheme that maintains model accuracy while greatly accelerating inference. It uses adaptive smoothing and a Hessian‑based compensation mechanism to mitigate precision loss typical of W4A8 quantization, without requiring extensive retraining.
Implementation: Two quantization granularities are designed: per‑channel and per‑group, each with a custom W4A8 GEMM kernel achieving 3.67× and 3.29× the speed of FP16 GEMM respectively.
Results: Compared with FP16, W8A8, and W4A16, QQQ delivers 2.24×, 2.10×, and 1.25× speedups respectively, while matching state‑of‑the‑art LLM quantization accuracy.
TokenFocus‑VQA: Enhancing Text‑to‑Image Alignment with Position‑Aware Focus and Multi‑Perspective Aggregations on LVLMs
Paper type: Poster
PDF: https://anonymous.4open.science/r/tf-D0C8/TokenFocus-VQA%20Enhancing%20Text-to-Image%20Alignment%20with%20Position-Aware%20Focus%20and%20Multi-Perspective%20Aggregations%20on%20LVLMs.pdf
Abstract: Building on the CVPR 2025 NTIRE Challenge, the authors treat text‑to‑image quality assessment as a visual‑question‑answering task with a position‑specific loss. The loss emphasizes probability distributions of semantically important words at their spatial locations, improving fine‑grained text‑image matching.
Additional technique: An ensemble of multiple LVLMs aggregates diverse perspectives, further boosting performance.
Results: On the NTIRE 2025 T2I quality benchmark, TokenFocus‑VQA ranks second on both the public (84.45%) and private (84.26%) leaderboards, outperforming traditional evaluation methods in capturing subtle text‑image correspondences.
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
Paper type: CVPR Main Conference
PDF: https://arxiv.org/pdf/2411.17606
Abstract: The work leverages Visual Large Language Models (VLLM) to address universal segmentation for images and videos. Existing unified segmentation methods struggle with adaptability across scenes and complex reasoning tasks.
Method: HyperSeg introduces a pixel‑level segmentation model built on VLLM, integrating hybrid entity recognition, fine‑grained visual perception modules, and a temporal adapter to handle both static and sequential data.
Results: Experiments confirm HyperSeg’s effectiveness on general segmentation tasks and more challenging reasoning‑perception tasks.
Marten: Visual Question Answering with Mask Generation for Multi‑modal Document Understanding
Paper type: CVPR Main Conference
PDF: https://arxiv.org/pdf/2503.14140
Abstract: The paper introduces a VQA‑with‑Mask (VQAMask) task to align visual and textual modalities in multimodal large language models for document image understanding. A mask generator, discarded at inference, ensures spatial alignment between visual text and image regions, reducing hallucinations.
Dataset: A 6‑million‑sample MTMask6M dataset supports the VQAMask task.
Model: The proposed Marten model, trained with VQAMask, achieves significant gains in speed, accuracy, and deployment cost for document image understanding.
LLaVA‑ST: A Multimodal Large Language Model for Fine‑Grained Spatial‑Temporal Understanding
Paper type: CVPR Main Conference
PDF: https://arxiv.org/pdf/2501.08282
Abstract: LLaVA‑ST tackles the explosion of spatio‑temporal coordinate combinations and loss of fine‑grained detail in video feature compression. It introduces a novel feature‑alignment mechanism, a spatio‑temporal compressor, and a multi‑stage training strategy.
Benchmark: The authors construct a 4.3 M‑sample ST‑Align dataset covering STVG, ELC, and SVG tasks.
Results: LLaVA‑ST excels across 11 benchmarks involving fine‑grained temporal understanding, spatial localization, and spatio‑temporal cross‑modal tasks, showing promise for video understanding, embodied AI, and autonomous driving.
Q‑Eval‑100K: Evaluating Visual Quality and Alignment Level for Text‑to‑Vision Content
Paper type: CVPR Main Conference
PDF: https://arxiv.org/pdf/2503.02357
Abstract: The authors build the largest AIGC quality‑evaluation dataset, Q‑Eval‑100K, containing 100 k human‑annotated scores (60 k images, 40 k videos) focusing on visual quality and alignment. They also propose the unified evaluation framework Q‑Eval‑Score.
Findings: Larger and higher‑quality annotation data substantially improve evaluation performance on both visual quality and alignment dimensions.
Diffusion‑4K: Ultra‑High‑Resolution Image Synthesis with Latent Diffusion Model
Paper type: CVPR Main Conference
PDF: https://cvpr.thecvf.com/virtual/2025/poster/32468
Abstract: Two contributions are presented: (1) A 4K‑resolution benchmark, Aesthetic‑4K, combining non‑DL metrics (GLCM, DCT) and GPT‑4o‑generated captions to assess local texture and structural artifacts; (2) A wavelet‑based diffusion generation paradigm that splits features into high‑ and low‑frequency components and enforces high‑frequency constraints, preserving fine texture and structure.
Implementation: An efficient block‑VAE enables Flux‑12B to generate 4K images on consumer GPUs such as NVIDIA 4090.
Deployment: Diffusion‑4K is applied in a drone perception pipeline, improving downstream detection and segmentation tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
