Artificial Intelligence 15 min read

Can a Pre‑trained Embodied Model Work Out‑of‑the‑Box? New Chinese Open‑Source VLA Model Shows Yes

The newly open‑sourced Wall‑OSS‑0.5 VLA model demonstrates that a large‑scale pre‑trained embodied robot brain can achieve strong zero‑shot performance on 17 real‑world tasks, exhibit staircase emergence with longer pre‑training, and far surpass the industry baseline after fine‑tuning, while also revealing current precision limits.

Machine Heart

May 28, 2026

Can a Pre‑trained Embodied Model Work Out‑of‑the‑Box? New Chinese Open‑Source VLA Model Shows Yes

At the beginning of 2026 the Chinese embodied‑intelligence community saw a wave of open‑source releases, with many teams publishing their vision‑language‑action (VLA) models, datasets, and training frameworks and shifting the competition to benchmark scores, task success rates, and cross‑task generalisation.

Most VLA evaluations are performed after task‑specific fine‑tuning, raising the fundamental question: are we training a universal robot brain or merely a set of task‑specific scripts? The X Square Robot team answered this by releasing Wall‑OSS‑0.5, a VLA model trained on more than 20 robot morphologies, over one million trajectory demonstrations and roughly 90 million multimodal tokens, and then testing it directly on a real robot without any fine‑tuning.

Wall‑OSS‑0.5 was evaluated zero‑shot on 17 tasks covering semantic understanding, rigid‑object manipulation, flexible‑object manipulation, precise manipulation and long‑horizon multi‑step control. The 400 k‑step checkpoint achieved scores above 80 / 100 on four tasks: Block Sorting (100), Fruit Sorting (96), Ring Stacking (86) and Rope Tightening (82), the latter being a previously unseen flexible‑object task that required coordinated bimanual control.

Training progress shows a “staircase emergence” pattern: as pre‑training steps increase from 50 k to 400 k, the average score on seen tasks rises from 26.1 to 50.0, while unseen tasks improve from 24.2 to 53.6, indicating genuine transfer rather than memorisation. However, tasks demanding high precision such as towel folding (10), table setting (9) and charger insertion (9) remain far below the passing threshold.

When fine‑tuned on 15 real‑robot tasks, Wall‑OSS‑0.5 outperforms the industry benchmark π0.5 by a large margin: with the same fine‑tuning data budget it reaches an average task progress of 60.5 vs 43.0 (a 17.5‑point lead) and exceeds π0.5 by over 30 points on a subset of 10 core operations. In the RoboCasa kitchen simulation, Wall‑OSS‑0.5 attains a 39.6 % success rate on the insertion task compared to π0.5’s 4.0 %. On the LIBERO single‑arm benchmark it achieves 97.5 % average success after only 20 k fine‑tuning steps, saving roughly one‑third of the compute required by π0.5.

Robustness tests on the RoboTwin platform (50 bimanual tasks with random lighting and background disturbances) show Wall‑OSS‑0.5 maintains an 80.9 % success rate, demonstrating strong out‑of‑distribution generalisation.

The key to these results is a set of four design innovations. First, a “gradient bridge” discretises actions into special tokens, concatenates them with text tokens, and trains the whole sequence with cross‑entropy loss, forcing the backbone to learn a unified “see‑speak‑act” representation. Ablations reveal that removing the bridge causes a dramatic drop in real‑robot success.

Second, a visual‑aligned residual‑vector‑quantiser tokeniser encodes actions while simultaneously aligning each token with the corresponding visual frame and predicting the next visual change, giving each token both motor and visual semantics.

Third, action‑space supervision replaces the conventional speed‑prediction loss with a reconstruction‑of‑the‑final‑trajectory loss, focusing optimisation on the low‑frequency structure that determines task success rather than on high‑frequency noise.

Fourth, the DMuon (distributed Muon) optimiser addresses the heterogeneous gradient scales between the large‑scale VLM backbone and the freshly‑initialised action head. By applying Newton‑Schulz orthogonalisation and a custom LPT‑based ownership scheduler, DMuon reduces the extra overhead of Muon from 2× to 0.02×, making the complex training pipeline feasible on large clusters.

Together, these components enable the backbone to truly experience actions during pre‑training, rather than merely observing them, and to retain its visual‑language capabilities. The open‑source release provides model weights, training code, the full set of ablation experiments, and the DMuon optimiser, offering a reproducible baseline for future embodied‑AI research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Embodied AI robotics pretraining zero-shot VLA

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.