How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks
The IH‑VQA team’s iMatch solution clinched the CVPR2025 NTIRE Image‑Text Alignment champion by introducing dual‑model fusion, pseudo‑label data augmentation, Q‑Align probability mapping, and visual augmentations, and the paper also presents a comprehensive iMatch benchmark evaluating 23 state‑of‑the‑art text‑to‑image models across multiple resolutions.
Task Background
Recent rapid advances in text‑to‑image (T2I) models such as Dreamina, DALL·E 3 and Midjourney have lowered creation barriers, but evaluating these models—especially for image‑text alignment, aesthetics, and structural integrity—remains challenging. The CVPR 2025 NTIRE Text‑to‑Image Generation Model Quality Assessment competition, co‑organized by TikTok and Nankai University, aims to advance AI‑generated content evaluation by establishing a fine‑grained image‑text alignment benchmark.
Competition Introduction
2.1 Dataset
The competition uses the EvalMuse benchmark, containing 40 K image‑text pairs generated by 20 mainstream T2I models from 4 K diverse prompts. Labels include prompt‑level and element‑level alignment scores.
2.2 Evaluation Rules
Ranking is based on a Main Score that combines Spearman’s rank correlation coefficient (SRCC) for monotonicity, Pearson’s linear correlation coefficient (PLCC) for accuracy, and element‑level accuracy (ACC). Higher SRCC, PLCC and ACC indicate better performance.
2.3 Competition Stages
Two stages: Development (Jan 30–Mar 14) with 30 K training pairs and 10 K test pairs, attracting 371 teams and 1 883 submissions; and Test (Mar 14–Mar 22) with 5 K test pairs, yielding 507 submissions, of which 12 teams provided final code and fact sheets.
Team Solution
3.1 Dual‑Model Fusion
The team identified two independent scoring metrics—Total_score (SRCC + PLCC) and Element_score (ACC)—and trained separate models specialized for each. A selective score‑fusion strategy combines the best predictions from both models, further enhanced by using Element predictions as features for Total_score.
3.2 Pseudo‑Label Data Augmentation
During the development phase, the team generated pseudo‑labels for the development test set using the best model, selected a proportion of these, and mixed them into the original training set, thereby enriching the data and improving final performance.
3.3 Q‑Align Probability Mapping
Inspired by Q‑Align for video quality, the team mapped rating levels to a 1‑15 scale and applied probability weighting to produce continuous scores, which yielded measurable performance gains.
3.4 Visual Data Augmentation
Standard visual augmentations (brightness adjustment, slight deformation, mild cropping) were applied while avoiding rotations that could corrupt positional text cues, resulting in additional performance improvements.
Competition Results
The IH‑VQA team secured first place, with their method published as “Instruction‑augmented Multimodal Alignment for Image‑Text and Element Matching” in the CVPR 2025 Workshop and presented orally. Their model outperformed the runner‑up by +2.4 % SRCC, +1.6 % PLCC, and +0.4 % ACC.
iMatch Benchmark
5.1 Benchmark Construction
The official test set contains 10 671 image‑text pairs with diverse queries. To address imbalance across models, the team built a fairer benchmark covering 23 models (18 open‑source, 5 closed‑source excluded) and generated images at 1024×1024 resolution for 913 queries, plus 512×512 and 768×768 for the open‑source models, resulting in 53 867 controlled pairs.
5.2 Result Analysis
Evaluation shows ByteDance’s seedream‑3.0 leading overall scores, while the emerging HIDREAM model excels in several fine‑grained categories. Across three resolutions (512, 768, 1024), HIDREAM consistently ranks first, demonstrating robust multi‑resolution performance.
The team plans to continue advancing multimodal quality assessment, extending to audio‑video domains and real‑time monitoring.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.