Tag

video‑audio evaluation

1 views collected around this technical thread.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Feb 17, 2025 · Artificial Intelligence

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.

benchmark datasetlarge modelsmodel analysis
0 likes · 12 min read
WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios