Xiaomi‑Robotics‑0: 20‑Hour Post‑Training Enables Seamless Earphone‑Box Assembly (Open‑Source)
The article details how Xiaomi‑Robotics‑0 achieves precise earphone‑to‑case insertion after only 20 hours of post‑training, outlines the sub‑millimetre precision challenges, presents a triple‑strategy (asynchronous execution, adaptive loss re‑weighting, Λ‑shape attention mask and random masking) to avoid the "lazy effect", and releases the full pipeline and code as open source for the robotics community.
Background
Two months ago Xiaomi released the Xiaomi‑Robotics‑0 model, which reached #6 on HuggingFace’s global VLA model download leaderboard.
Accelerated evolution in 20 hours
Using the pre‑trained base, only 20 hours of task‑specific data were used to teach the model the high‑difficulty action “storing an earphone into its case” and to perform continuous, smooth insertions of multiple earphones.
The task presents two core challenges:
Sub‑millimetre clearance between earphone and slot requires sub‑millimetre spatial perception.
Case surface roughness as low as Ra 0.03 µm causes the earphone to shift on contact; the model must quickly correct deviations to avoid assembly failure.
Triple strategy to overcome the “lazy effect”
Deployment uses Asynchronous Execution: while executing the current trajectory, the next step is inferred in parallel. Action Prefixing provides a “run‑up” that lets the new action grow naturally from the existing one, ensuring smooth trajectory transitions.
Action Prefixing revealed a “lazy effect” where the model over‑relies on motion inertia and neglects real‑time visual feedback. Three techniques balance continuity and responsiveness:
Adaptive Loss Re‑weighting : dynamically adjusts loss weight according to deviation between predicted and ground‑truth trajectories, forcing the model to focus on large errors.
Λ‑Shape Attention Mask : a specialized attention mechanism that keeps the model attentive to current visual signals while still referencing the end of the previous action, avoiding pure path‑dependency.
Random Masking of Action Prefixes : during training, existing action prefixes are randomly dropped out, compelling the model to rely on camera and sensor inputs rather than inertia.
Open‑source resources
Post‑training is identified as the “last mile” for deploying VLA models in real‑world robotics. The complete data‑processing pipeline, training scripts, and inference code are released publicly.
Resources:
Technical website: https://robotics.xiaomi.com
Technical report (arXiv): https://arxiv.org/abs/2602.12684
Model weights on HuggingFace: https://huggingface.co/XiaomiRobotics
Open‑source code: https://github.com/XiaomiRobotics/Xiaomi-Robotics-0
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
