ATLAS: One Word Unifies Agentic and Latent Visual Reasoning
ATLAS introduces a discrete functional token that simultaneously serves as an agentic operation and a latent reasoning unit, enabling large multimodal models to perform visual tasks without external tools or intermediate image generation, and achieves competitive results through SFT‑plus‑RL training and a token‑level gradient‑anchor technique.
Meta AI and the Chinese University of Hong Kong introduce ATLAS, a visual‑reasoning paradigm that replaces the split among Unified Models, Agentic Visual Reasoning, and Latent Visual Reasoning with a single discrete token – the Functional Token.
Limitations of Existing Paradigms
Unified Models generate explicit intermediate images, incurring high computational cost and requiring visual supervision. Agentic methods invoke external tools (e.g., drawing lines, cropping) which adds latency and needs extra process supervision. Latent approaches keep reasoning inside the model but suffer from poor scalability, weak interpretability, and often need special training mechanisms.
ATLAS Core Idea
ATLAS shows that a single word can serve both as an Agentic Operation (telling the model which visual action to perform) and as a Latent Visual Reasoning Unit (participating in internal reasoning without generating images). These Functional Tokens – e.g., <|Line|>, <|Shape|>, <|Arrow|>, <|Text|> – are ordinary vocabulary items generated by next‑token prediction, yet they trigger specific visual actions inside the model.
Training Procedure
ATLAS is trained in two stages:
SFT (Supervised Fine‑Tuning) : The ATLAS‑178K dataset, covering more than 40 visual‑reasoning tasks, maps complex visual operations to Functional Tokens. The model learns not only final answers but also the intermediate token‑level reasoning trajectory, e.g., inserting <|Line|> when a line is needed, <|Shape|> for region marking, <|Arrow|> for directional relations, and <|Text|> for annotation.
RL (Reinforcement Learning) : A reward balances answer correctness and appropriate use of Functional Tokens. Over‑generation of tokens is penalized to prevent token spam, encouraging emission only when a genuine visual operation is required.
LA‑GRPO: Addressing Gradient Dilution
Functional Tokens are sparse, so standard sequence‑level rewards dilute their gradient signal. ATLAS introduces Latent‑Anchored GRPO (LA‑GRPO), which adds a token‑level anchor to amplify gradients for critical Functional Tokens when they contribute to a correct answer, ensuring the model learns the importance of each visual‑action word.
Experimental Validation
Benchmarks on challenging geometric, spatial‑relation, multi‑view, counting, and fine‑grained visual tasks demonstrate that ATLAS achieves competitive performance while remaining lightweight. ATLAS does not require external tool execution, intermediate image generation, or changes to the standard autoregressive training pipeline.
Attention Analysis
When the model generates a Functional Token, its attention focuses on the relevant visual region: <|Shape|> attends to target objects, <|Line|> to geometric structures, and <|Text|> to areas needing annotation. This shows that the tokens actively guide internal visual processing rather than serving as mere placeholders.
Significance
ATLAS proposes a concise visual‑action language that preserves scalability, generalization, and interpretability while eliminating costly intermediate steps. The approach offers a new capability interface for multimodal models, bridging explicit tool‑based reasoning and opaque latent reasoning.
Paper: https://arxiv.org/pdf/2605.15198 | Project Page: https://atlas-oneword.github.io | Code: https://github.com/ZiyuGuo99/ATLAS
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
