Tagged articles

13 articles

Page 1 of 1

May 28, 2026 · Artificial Intelligence

How AutoMoT Leverages Large‑Model Understanding for End‑to‑End Driving Decisions and Trajectory Planning

AutoMoT introduces a unified Vision‑Language‑Action model that combines a 4B Qwen3‑VL understanding expert with a 1.6B action expert via layer‑wise shared attention and asynchronous inference, achieving state‑of‑the‑art results on Bench2Drive and nuScenes while preserving general VLM capabilities.

Asynchronous InferenceAutoMoTAutonomous Driving

0 likes · 10 min read

How AutoMoT Leverages Large‑Model Understanding for End‑to‑End Driving Decisions and Trajectory Planning

Machine Heart

May 22, 2026 · Artificial Intelligence

Can World Action Models Replace VLA? Nvidia’s New Embodied AI Paradigm Reviewed

The article reviews the emerging World Action Model (WAM) paradigm, critiques the limitations of Vision‑Language‑Action models, outlines cascaded and joint WAM architectures, discusses required data sources, evaluation metrics, and future challenges, positioning WAM as a new foundational approach for embodied AI.

Data FusionEmbodied AIFuture State Prediction

0 likes · 14 min read

Can World Action Models Replace VLA? Nvidia’s New Embodied AI Paradigm Reviewed

Machine Heart

May 22, 2026 · Artificial Intelligence

HiF-VLA: Motion‑Centric ‘Think‑While‑Doing’ World Action Model Breaks Short‑Sighted Limits

HiF-VLA introduces a motion‑centric bidirectional spatiotemporal reasoning framework with a joint‑expert module that simultaneously predicts future visual motion and generates high‑precision action sequences, eliminating visual redundancy, cutting inference latency and memory usage, and achieving superior success rates on long‑horizon benchmarks such as CALVIN and LIBERO‑LONG.

HiF-VLAMotion RepresentationVision-Language-Action

0 likes · 9 min read

HiF-VLA: Motion‑Centric ‘Think‑While‑Doing’ World Action Model Breaks Short‑Sighted Limits

Machine Heart

May 16, 2026 · Artificial Intelligence

Why Robots Need World Models: A Joint Survey from Leading Institutions

This article surveys recent advances in robot world models, explaining why predictive models are essential for embodied intelligence, how they integrate with Vision‑Language‑Action systems, the various architectural approaches, benchmark trends, and the remaining challenges for reliable deployment.

SimulationSurveyVision-Language-Action

0 likes · 14 min read

Why Robots Need World Models: A Joint Survey from Leading Institutions

Meituan Technology Team

Apr 23, 2026 · Artificial Intelligence

LARYBench Introduces an ImageNet‑Style Benchmark for Embodied Action Representations Learned from Human Video

LARYBench (Latent Action Representation Yielding Benchmark) provides the first systematic, ImageNet‑scale evaluation for implicit action representations derived from large‑scale human video, decoupling representation quality from downstream control, and shows that general‑purpose vision models outperform specialized embodied models in both action generalization and control precision across diverse robot morphologies and environments.

Embodied AIVision-Language-Actionaction representation

0 likes · 13 min read

LARYBench Introduces an ImageNet‑Style Benchmark for Embodied Action Representations Learned from Human Video

Machine Heart

Apr 18, 2026 · Artificial Intelligence

Eliminating ‘Think‑Then‑Act’ Stalls: StreamingVLA Boosts VLA Speed by 2.4×

StreamingVLA introduces action‑flow matching and adaptive early observation to parallelize generation, execution, and perception in vision‑language‑action models, cutting per‑action latency from 49.9 ms to 31.6 ms, reducing stall time 6.5‑fold, and achieving up to 2.4× end‑to‑end speedup in LIBERO benchmarks and real‑world robot tests.

LIBEROLatencyParallel Execution

0 likes · 13 min read

Eliminating ‘Think‑Then‑Act’ Stalls: StreamingVLA Boosts VLA Speed by 2.4×

Machine Heart

Apr 11, 2026 · Artificial Intelligence

Why VLA Pioneers Are Abandoning Vision‑Language‑Action Models

Generalist AI’s GEN-1 model achieves over 99% success, 2‑3× speed gains with only a tenth of the data, and its founders argue that vision‑language‑action (VLA) models are merely a crutch, urging a shift toward goal‑driven, fully‑scratch training for physical AGI.

GEN-1Generalist AIGoal-driven research

0 likes · 13 min read

Why VLA Pioneers Are Abandoning Vision‑Language‑Action Models

Machine Heart

Mar 31, 2026 · Artificial Intelligence

Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding

The Point‑VLA method introduced by Qianxun AI’s Gaoyang team tackles the fundamental limits of language‑only instruction in vision‑language‑action models by adding visual grounding via bounding‑box cues, boosting real‑robot success rates from 32.4% to 92.5% across six challenging tasks.

Multimodal LearningPoint-VLAVision-Language-Action

0 likes · 13 min read

Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding

HyperAI Super Neural

Feb 19, 2026 · Artificial Intelligence

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

This roundup highlights six recent embodied AI papers that advance world models and vision‑language‑action (VLA) techniques, covering DreamDojo's massive first‑person video model, LingBot‑World simulator, Agent World Model generator, BagelVLA, ACoT‑VLA, and the closed‑loop World‑VLA‑Loop framework.

Embodied AISynthetic EnvironmentsVision-Language-Action

0 likes · 8 min read

World Model & VLA Breakthroughs: Top Papers from NVIDIA, ByteDance, Tsinghua and Others

HyperAI Super Neural

Dec 12, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Attention, Nvidia VLA, TTS, and Graph Neural Networks

This roundup presents five recent AI papers covering hierarchical sparse attention for ultra‑long context, Nvidia's Alpamayo‑R1 VLA model for autonomous driving, the non‑autoregressive F5‑TTS system, LatentMAS for latent‑space multi‑agent collaboration, and Deeper‑GXX that deepens arbitrary graph neural networks, highlighting each method's key innovations and reported performance gains.

Autonomous DrivingMulti-Agent SystemsVision-Language-Action

0 likes · 6 min read

Weekly AI Paper Digest: Attention, Nvidia VLA, TTS, and Graph Neural Networks

Data Party THU

Oct 29, 2025 · Artificial Intelligence

Can Test-Time Scaling Unlock More Reliable Vision‑Language‑Action Robots?

The paper introduces RoboMonkey, a framework that applies a generate‑and‑verify paradigm and test‑time scaling to Vision‑Language‑Action models, showing that increasing sampling and verification at inference dramatically reduces action error across multiple VLA architectures, and presents scalable verifier training, synthetic data augmentation, and efficient deployment strategies.

AI researchAction VerificationRoboMonkey

0 likes · 8 min read

Can Test-Time Scaling Unlock More Reliable Vision‑Language‑Action Robots?

Amap Tech

Oct 6, 2025 · Artificial Intelligence

Breaking VLA Training Limits: World-Env’s Virtual Sandbox for Safe, Data‑Efficient Robotics

World-Env introduces a virtual training sandbox that eliminates physical interaction, dramatically improves data efficiency with just five expert demos per task, and employs a vision‑language model as a semantic judge to dynamically terminate actions, enabling safe, high‑performing VLA post‑training across diverse robotic benchmarks.

Vision-Language-ActionWorld Modeldata efficiency

0 likes · 9 min read

Breaking VLA Training Limits: World-Env’s Virtual Sandbox for Safe, Data‑Efficient Robotics

AI Cyberspace

Feb 23, 2025 · Artificial Intelligence

How Helix Empowers Humanoid Robots to See, Hear, Understand, and Act

Helix is a groundbreaking Vision‑Language‑Action model that integrates perception, language understanding, and motor control, enabling humanoid robots to perform full upper‑body continuous movements, collaborate across multiple robots, grasp any household object via natural language, and run on low‑power embedded GPUs for commercial use.

Embodied AIVision-Language-Actiongeneralist control

0 likes · 16 min read

How Helix Empowers Humanoid Robots to See, Hear, Understand, and Act