Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)

The article details how Xiaomi‑Robotics‑0 achieves precise earphone‑to‑case insertion after only 20 hours of post‑training, outlines the sub‑millimetre precision challenges, presents a triple‑strategy (asynchronous execution, adaptive loss re‑weighting, Λ‑shape attention mask and random masking) to avoid the "lazy effect", and releases the full pipeline and code as open source for the robotics community.

Xiaomi Tech
Xiaomi Tech
Xiaomi Tech
Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)

Background

Two months ago Xiaomi released the Xiaomi‑Robotics‑0 model, which reached #6 on HuggingFace’s global VLA model download leaderboard.

Accelerated evolution in 20 hours

Using the pre‑trained base, only 20 hours of task‑specific data were used to teach the model the high‑difficulty action “storing an earphone into its case” and to perform continuous, smooth insertions of multiple earphones.

The task presents two core challenges:

Sub‑millimetre clearance between earphone and slot requires sub‑millimetre spatial perception.

Case surface roughness as low as Ra 0.03 µm causes the earphone to shift on contact; the model must quickly correct deviations to avoid assembly failure.

Triple strategy to overcome the “lazy effect”

Deployment uses Asynchronous Execution: while executing the current trajectory, the next step is inferred in parallel. Action Prefixing provides a “run‑up” that lets the new action grow naturally from the existing one, ensuring smooth trajectory transitions.

Action Prefixing revealed a “lazy effect” where the model over‑relies on motion inertia and neglects real‑time visual feedback. Three techniques balance continuity and responsiveness:

Adaptive Loss Re‑weighting : dynamically adjusts loss weight according to deviation between predicted and ground‑truth trajectories, forcing the model to focus on large errors.

Λ‑Shape Attention Mask : a specialized attention mechanism that keeps the model attentive to current visual signals while still referencing the end of the previous action, avoiding pure path‑dependency.

Random Masking of Action Prefixes : during training, existing action prefixes are randomly dropped out, compelling the model to rely on camera and sensor inputs rather than inertia.

Open‑source resources

Post‑training is identified as the “last mile” for deploying VLA models in real‑world robotics. The complete data‑processing pipeline, training scripts, and inference code are released publicly.

Resources:

Technical website: https://robotics.xiaomi.com

Technical report (arXiv): https://arxiv.org/abs/2602.12684

Model weights on HuggingFace: https://huggingface.co/XiaomiRobotics

Open‑source code: https://github.com/XiaomiRobotics/Xiaomi-Robotics-0

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Embodied AIasynchronous executionpost-trainingaction prefixingadaptive loss re-weightingΛ‑shape attention maskXiaomi Robotics
Xiaomi Tech
Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.