ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

ATLAS introduces a discrete functional token that simultaneously serves as an agentic operation and a latent reasoning unit, enabling large multimodal models to perform visual tasks without external tools or intermediate image generation, and achieves competitive results through SFT‑plus‑RL training and a token‑level gradient‑anchor technique.

Machine Heart
Machine Heart
Machine Heart
ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

Meta AI and the Chinese University of Hong Kong introduce ATLAS, a visual‑reasoning paradigm that replaces the split among Unified Models, Agentic Visual Reasoning, and Latent Visual Reasoning with a single discrete token – the Functional Token.

Limitations of Existing Paradigms

Unified Models generate explicit intermediate images, incurring high computational cost and requiring visual supervision. Agentic methods invoke external tools (e.g., drawing lines, cropping) which adds latency and needs extra process supervision. Latent approaches keep reasoning inside the model but suffer from poor scalability, weak interpretability, and often need special training mechanisms.

ATLAS Core Idea

ATLAS shows that a single word can serve both as an Agentic Operation (telling the model which visual action to perform) and as a Latent Visual Reasoning Unit (participating in internal reasoning without generating images). These Functional Tokens – e.g., <|Line|>, <|Shape|>, <|Arrow|>, <|Text|> – are ordinary vocabulary items generated by next‑token prediction, yet they trigger specific visual actions inside the model.

Training Procedure

ATLAS is trained in two stages:

SFT (Supervised Fine‑Tuning) : The ATLAS‑178K dataset, covering more than 40 visual‑reasoning tasks, maps complex visual operations to Functional Tokens. The model learns not only final answers but also the intermediate token‑level reasoning trajectory, e.g., inserting <|Line|> when a line is needed, <|Shape|> for region marking, <|Arrow|> for directional relations, and <|Text|> for annotation.

RL (Reinforcement Learning) : A reward balances answer correctness and appropriate use of Functional Tokens. Over‑generation of tokens is penalized to prevent token spam, encouraging emission only when a genuine visual operation is required.

LA‑GRPO: Addressing Gradient Dilution

Functional Tokens are sparse, so standard sequence‑level rewards dilute their gradient signal. ATLAS introduces Latent‑Anchored GRPO (LA‑GRPO), which adds a token‑level anchor to amplify gradients for critical Functional Tokens when they contribute to a correct answer, ensuring the model learns the importance of each visual‑action word.

Experimental Validation

Benchmarks on challenging geometric, spatial‑relation, multi‑view, counting, and fine‑grained visual tasks demonstrate that ATLAS achieves competitive performance while remaining lightweight. ATLAS does not require external tool execution, intermediate image generation, or changes to the standard autoregressive training pipeline.

Attention Analysis

When the model generates a Functional Token, its attention focuses on the relevant visual region: <|Shape|> attends to target objects, <|Line|> to geometric structures, and <|Text|> to areas needing annotation. This shows that the tokens actively guide internal visual processing rather than serving as mere placeholders.

Significance

ATLAS proposes a concise visual‑action language that preserves scalability, generalization, and interpretability while eliminating costly intermediate steps. The approach offers a new capability interface for multimodal models, bridging explicit tool‑based reasoning and opaque latent reasoning.

Paper: https://arxiv.org/pdf/2605.15198 | Project Page: https://atlas-oneword.github.io | Code: https://github.com/ZiyuGuo99/ATLAS

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AIreinforcement learningvisual reasoningATLASlatent reasoningagentic reasoningfunctional token
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.