Artificial Intelligence 15 min read

How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks

The IH‑VQA team’s iMatch solution clinched the CVPR2025 NTIRE Image‑Text Alignment champion by introducing dual‑model fusion, pseudo‑label data augmentation, Q‑Align probability mapping, and visual augmentations, and the paper also presents a comprehensive iMatch benchmark evaluating 23 state‑of‑the‑art text‑to‑image models across multiple resolutions.

Tencent Technical Engineering

Jun 30, 2025

How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks

Task Background

Recent rapid advances in text‑to‑image (T2I) models such as Dreamina, DALL·E 3 and Midjourney have lowered creation barriers, but evaluating these models—especially for image‑text alignment, aesthetics, and structural integrity—remains challenging. The CVPR 2025 NTIRE Text‑to‑Image Generation Model Quality Assessment competition, co‑organized by TikTok and Nankai University, aims to advance AI‑generated content evaluation by establishing a fine‑grained image‑text alignment benchmark.

Competition Introduction

2.1 Dataset

The competition uses the EvalMuse benchmark, containing 40 K image‑text pairs generated by 20 mainstream T2I models from 4 K diverse prompts. Labels include prompt‑level and element‑level alignment scores.

2.2 Evaluation Rules

Ranking is based on a Main Score that combines Spearman’s rank correlation coefficient (SRCC) for monotonicity, Pearson’s linear correlation coefficient (PLCC) for accuracy, and element‑level accuracy (ACC). Higher SRCC, PLCC and ACC indicate better performance.

2.3 Competition Stages

Two stages: Development (Jan 30–Mar 14) with 30 K training pairs and 10 K test pairs, attracting 371 teams and 1 883 submissions; and Test (Mar 14–Mar 22) with 5 K test pairs, yielding 507 submissions, of which 12 teams provided final code and fact sheets.

Team Solution

3.1 Dual‑Model Fusion

The team identified two independent scoring metrics—Total_score (SRCC + PLCC) and Element_score (ACC)—and trained separate models specialized for each. A selective score‑fusion strategy combines the best predictions from both models, further enhanced by using Element predictions as features for Total_score.

3.2 Pseudo‑Label Data Augmentation

During the development phase, the team generated pseudo‑labels for the development test set using the best model, selected a proportion of these, and mixed them into the original training set, thereby enriching the data and improving final performance.

3.3 Q‑Align Probability Mapping

Inspired by Q‑Align for video quality, the team mapped rating levels to a 1‑15 scale and applied probability weighting to produce continuous scores, which yielded measurable performance gains.

3.4 Visual Data Augmentation

Standard visual augmentations (brightness adjustment, slight deformation, mild cropping) were applied while avoiding rotations that could corrupt positional text cues, resulting in additional performance improvements.

Competition Results

The IH‑VQA team secured first place, with their method published as “Instruction‑augmented Multimodal Alignment for Image‑Text and Element Matching” in the CVPR 2025 Workshop and presented orally. Their model outperformed the runner‑up by +2.4 % SRCC, +1.6 % PLCC, and +0.4 % ACC.

iMatch Benchmark

5.1 Benchmark Construction

The official test set contains 10 671 image‑text pairs with diverse queries. To address imbalance across models, the team built a fairer benchmark covering 23 models (18 open‑source, 5 closed‑source excluded) and generated images at 1024×1024 resolution for 913 queries, plus 512×512 and 768×768 for the open‑source models, resulting in 53 867 controlled pairs.

5.2 Result Analysis

Evaluation shows ByteDance’s seedream‑3.0 leading overall scores, while the emerging HIDREAM model excels in several fine‑grained categories. Across three resolutions (512, 768, 1024), HIDREAM consistently ranks first, demonstrating robust multi‑resolution performance.

The team plans to continue advancing multimodal quality assessment, extending to audio‑video domains and real‑time monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal Evaluation CVPR2025 AI quality assessment image-text alignment iMatch

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.