WBench: 20 Cutting‑Edge World Models Face a Comprehensive Interactive Benchmark

WBench, a new benchmark created by Meituan LongCat and Fudan University, evaluates 20 state‑of‑the‑art video and world‑model systems across 289 test cases and 1,058 interaction rounds, measuring video quality, setting adherence, interaction fidelity, consistency and physical compliance, and reveals that no model yet excels in all five dimensions.

ConsistencyInteractive BenchmarkMultimodal Evaluation

0 likes · 10 min read

WBench: 20 Cutting‑Edge World Models Face a Comprehensive Interactive Benchmark

SuanNi

Mar 2, 2026 · Artificial Intelligence

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

The newly released Humanity’s Last Exam (HLE) benchmark, featuring 2,500 rigorously crafted multimodal questions across more than 100 disciplines, exposes the severe shortcomings of leading AI models, whose accuracy stays below 50% and shows alarming calibration errors, highlighting the urgent need for deeper AI evaluation.

Artificial IntelligenceHumanity's Last ExamMultimodal Evaluation

0 likes · 13 min read

Why Leading AI Models Flunk the New ‘Humanity’s Last Exam’ Benchmark

Tencent Technical Engineering

Jun 30, 2025 · Artificial Intelligence

How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks

The IH‑VQA team’s iMatch solution clinched the CVPR2025 NTIRE Image‑Text Alignment champion by introducing dual‑model fusion, pseudo‑label data augmentation, Q‑Align probability mapping, and visual augmentations, and the paper also presents a comprehensive iMatch benchmark evaluating 23 state‑of‑the‑art text‑to‑image models across multiple resolutions.

AI quality assessmentCVPR2025Multimodal Evaluation

0 likes · 15 min read

How iMatch Won CVPR2025 NTIRE Image-Text Alignment: Techniques & Benchmarks

Sohu Tech Products

Jul 31, 2024 · Artificial Intelligence

MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models

MMEvalPro, a new benchmark created by researchers from Peking University, Chinese Academy of Medical Sciences, CUHK and Alibaba, augments existing multimodal datasets with perception and knowledge questions and introduces a Genuine Accuracy metric, revealing that top multimodal models still lag far behind humans and exposing shortcut‑driven performance on prior tests.

MMEvalProMultimodal Evaluationbenchmark

0 likes · 11 min read

MMEvalPro: A Trustworthy Benchmark for Evaluating Multimodal Large Models