ESI‑Bench: The ImageNet‑Style Benchmark for Embodied Spatial Intelligence

ESI‑Bench, introduced by Fei‑Fei Li's team, transforms the observer into an active agent to evaluate embodied spatial intelligence across 10 task categories and 3,081 instances, revealing that perception is not the bottleneck, action strategies are critical, imperfect 3D reconstructions can hurt performance, and current models suffer from action blindness and metacognitive deficits compared with humans.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
ESI‑Bench: The ImageNet‑Style Benchmark for Embodied Spatial Intelligence

ESI‑Bench Overview

ESI‑Bench is a benchmark for embodied spatial intelligence that closes the perception‑action loop by requiring an AI agent to actively move, observe and interact in a simulated environment.

It contains 10 task categories, 29 sub‑categories and 3,081 task instances built on the OmniGibson simulation platform with scene assets from the BEHAVIOR‑1K library. Tasks are grounded in Spelke’s four core spatial‑cognition systems: object representation, layout & geometry, quantity representation, and goal‑directed action.

Design principle – Action Enforcement : for every question the agent must actively explore (e.g., walk around an object, pick it up, pour water) because the correct answer is never present in a single static image.

Example tasks :

Rigid‑containment – the agent must approach containers, open lids or remove obstacles and inspect interiors to decide whether all objects fit.

Liquid‑volume – the agent must pour water into cups or lift them to infer capacity differences.

Experimental Setup

State‑of‑the‑art multimodal models (GPT‑5, Gemini 3.1, Gemini 1.5) were evaluated under three paradigms: passive perception (oracle view), active exploration, and oracle (ground‑truth) conditions. Human participants performed the same tasks for baseline comparison.

Core Findings

Perception is not the bottleneck; action is. When Gemini 3.1 receives the optimal viewpoint on a partial‑occlusion task, accuracy jumps from 14.6 % to 95.1 %. The same model under active exploration achieves only 53.9 % because it fails to select the informative viewpoint, illustrating “Action Blindness” – a cascade where a poor action yields a poor view, leading to further poor actions.

Imperfect 3D reconstruction can be detrimental. Supplying a perfect 3‑D scene improves performance (Gemini’s score rises from 44.0 % to 60.4 % on a material‑transparency task). By contrast, feeding scenes reconstructed by the VGGT model drops geometry‑configuration accuracy from 27.5 % to 9.9 %, showing that noisy 3‑D inputs act as “toxic” data.

Metacognitive deficiency. Models often stop after a few steps with high confidence even when evidence is ambiguous, producing spatial hallucinations. Humans collect additional observations, seek disconfirming views and lower confidence in uncertain situations, leading to substantially higher accuracy under active exploration (e.g., humans 88.3 % vs. GPT‑5 64.2 % on a physical‑contact task).

Additional Observations

Active‑exploration strategies such as moving behind objects, switching to top‑down views, picking up items, or pouring out water emerge spontaneously in models that receive no explicit instruction.

When provided with a ground‑truth “god‑view” 3‑D scene, Gemini’s accuracy on material‑transparency rises by 16.4 percentage points (44.0 % → 60.4 %).

VGGT‑based reconstructions degrade performance dramatically (27.5 % → 9.9 % on geometry configuration), indicating that low‑quality 3‑D reconstructions are worse than raw 2‑D images.

Human participants exhibit stronger “cognitive caution”: they gather more observations, actively search for viewpoints that could falsify their hypothesis, and reduce confidence when the scene is ambiguous.

References

arXiv preprint: https://arxiv.org/abs/2605.18746

Project website: https://esi-bench.github.io/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BenchmarkEmbodied AImetacognitionspatial intelligenceaction blindnessperception‑action loop
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.