Artificial Intelligence 19 min read

Meituan's 10 Papers at CVPR 2025 and ICLR 2025

This article presents concise summaries of ten selected ICLR 2025 and CVPR 2025 papers covering LLM alignment, temporal‑decay DPO, joint‑embedding predictive architecture, 4‑bit quantization, token‑focused VQA, universal visual segmentation, document understanding, fine‑grained spatio‑temporal modeling, visual quality evaluation, and ultra‑high‑resolution diffusion, and also announces face‑to‑face and online sharing sessions hosted by Meituan.

Meituan Technology Team

Apr 10, 2025

Meituan's 10 Papers at CVPR 2025 and ICLR 2025

Enhancing LLM Alignment with Ternary Preferences

Paper type: Poster

PDF: https://openreview.net/forum?id=utkGLDSNOk

Abstract: The authors propose a ternary‑preference alignment method for large language models (LLM) to address limitations of binary preference models such as Bradley‑Terry, which struggle with noisy labels and tie cases. They introduce the TOBT model that explicitly models preference, non‑preference, and tie states, and design an algorithm that leverages ternary data to improve alignment robustness.

Results: Experiments on Mistral‑7B and Llama 3‑8B show accuracy gains of 6.5% (in‑distribution) and 3.2% (out‑of‑distribution) on datasets such as Ultrafeedback and Reward Bench. The method also outperforms DPO on MT‑Bench, Piqa, ARC, and MMLU, and remains superior on traditional binary alignment tasks.

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Paper type: Poster

PDF: https://openreview.net/forum?id=OspqtLVUN5

Abstract: Analysis of existing DPO methods shows they treat each reward in a sequence uniformly, ignoring temporal dynamics. KL‑divergence analysis on three open‑source models reveals that early tokens are more affected by DPO, and the effect diminishes with position, confirming prior findings that early tokens are critical for alignment.

Method: The paper introduces Temporal‑decay DPO (D²PO), which adds a time‑decay factor γ to dynamically adjust each reward’s contribution during training. D²PO preserves DPO’s efficiency while enhancing early‑token influence.

Results: D²PO achieves significant improvements on benchmarks such as AlpacaEval2, Arena‑Hard, and MT‑Bench without harming general capability.

Denoising with a Joint‑Embedding Predictive Architecture

Paper type: Poster

PDF: https://arxiv.org/pdf/2410.03755

Abstract: To address the bottleneck of multimodal generation models on continuous data, the authors propose D‑JEPA, which fuses the strengths of Joint‑Embedding Predictive Architecture (JEPA) and diffusion models. JEPA excels at self‑supervised representation learning, while diffusion models can model arbitrary distributions but lack integration with advanced representations.

Key contributions:

Reinterpret JEPA as a generalized masked‑image modeling framework and extend it to a continuous‑space autoregressive generation paradigm.

Introduce a FlowMatching‑based diffusion loss that retains JEPA’s structured representations while precisely modeling token‑level distributions.

Build a unified training pipeline that combines JEPA’s efficient representation learning with diffusion’s fine‑grained distribution modeling.

Results: Systematic experiments demonstrate that D‑JEPA outperforms single‑model baselines and existing fusion architectures in computational efficiency, generation quality, and cross‑modal transferability.

QQQ: Quality Quattuor‑Bit Quantization for Large Language Models

Paper type: Workshop

PDF: https://arxiv.org/pdf/2406.09904

Abstract: QQQ proposes a 4‑bit weight / 8‑bit activation (W4A8) quantization scheme that maintains model accuracy while greatly accelerating inference. It uses adaptive smoothing and a Hessian‑based compensation mechanism to mitigate precision loss typical of W4A8 quantization, without requiring extensive retraining.

Implementation: Two quantization granularities are designed: per‑channel and per‑group, each with a custom W4A8 GEMM kernel achieving 3.67× and 3.29× the speed of FP16 GEMM respectively.

Results: Compared with FP16, W8A8, and W4A16, QQQ delivers 2.24×, 2.10×, and 1.25× speedups respectively, while matching state‑of‑the‑art LLM quantization accuracy.

TokenFocus‑VQA: Enhancing Text‑to‑Image Alignment with Position‑Aware Focus and Multi‑Perspective Aggregations on LVLMs

Paper type: Poster

PDF: https://anonymous.4open.science/r/tf-D0C8/TokenFocus-VQA%20Enhancing%20Text-to-Image%20Alignment%20with%20Position-Aware%20Focus%20and%20Multi-Perspective%20Aggregations%20on%20LVLMs.pdf

Abstract: Building on the CVPR 2025 NTIRE Challenge, the authors treat text‑to‑image quality assessment as a visual‑question‑answering task with a position‑specific loss. The loss emphasizes probability distributions of semantically important words at their spatial locations, improving fine‑grained text‑image matching.

Additional technique: An ensemble of multiple LVLMs aggregates diverse perspectives, further boosting performance.

Results: On the NTIRE 2025 T2I quality benchmark, TokenFocus‑VQA ranks second on both the public (84.45%) and private (84.26%) leaderboards, outperforming traditional evaluation methods in capturing subtle text‑image correspondences.