Artificial Intelligence 14 min read

How PSI Lab’s Three Award‑Winning Papers Define a Systematic Humanoid Robot Learning Framework

The PSI Lab at USC, led by Wang Yue, secured three CVPR 2026 awards—Psi‑0, PhysWorld and Humanoid Everyday—each tackling a distinct stage of humanoid robot learning: large‑scale human video pre‑training, embodiment‑aligned fine‑tuning, and physics‑aware world modeling, together forming a coherent data‑model‑prediction pipeline.

Machine Heart

Jun 9, 2026

How PSI Lab’s Three Award‑Winning Papers Define a Systematic Humanoid Robot Learning Framework

At the University of Southern California, Wang Yue’s Physical Superintelligence Lab (PSI Lab) has become one of the fastest‑growing groups in embodied intelligence, earning multiple honors such as the NVIDIA Graduate Fellowship, Toyota Young Faculty Researcher, and Powell Faculty Fellowship. In the CVPR 2026 Embodied AI Workshop, the lab won three awards: Best Paper for Psi‑0, Best Paper Runner‑up for PhysWorld, and Best Paper for Humanoid Everyday, with all three papers subsequently accepted at major conferences (RSS 2026 and ICRA 2026).

Why the Three Works Matter

The three papers address the three most missing components for humanoid robots: (1) a large, diverse data foundation, (2) a foundation model that can be transferred to a robot body, and (3) a world model that predicts the physical consequences of actions. Together they outline a systematic learning pipeline: collect realistic, varied data, train a base model on that data, then endow the model with physical predictability.

Psi‑0: A Foundation Model for Humanoid Loco‑Manipulation

Psi‑0 (Ψ₀: An Open Foundation Model Towards Universal Humanoid Loco‑Manipulation) targets "loco‑manipulation" tasks that combine movement and manipulation, such as pushing carts, delivering objects, or opening faucets. The authors argue that training a humanoid foundation model requires staged data:

Stage 1 – Pre‑training: ~ 829 hours of EgoDex first‑person human video provide broad visual‑interaction priors.

Stage 2 – Post‑training: ~ 31 hours of Humanoid Everyday robot trajectories align the priors with the robot’s embodiment, joint limits, and control constraints.

Stage 3 – Task Adaptation: A small amount of task‑specific data fine‑tunes the model for concrete objectives.

This staged approach separates the roles of human video (scale) and robot trajectories (embodiment alignment), avoiding a naïve mixture of heterogeneous data.

PhysWorld: Bringing Physical Actionability to World Models

PhysWorld (Robot Learning from a Physical World Model) shifts world‑model research from pure video prediction to "physical actionability"—the ability to convert predicted futures into executable robot trajectories. The pipeline consists of three steps:

Given an image and task instruction, generate a task‑relevant video.

Reconstruct the underlying physical world from the video, yielding an object‑centric scene representation.

Apply object‑centric residual reinforcement learning to refine the visual prediction into a robot‑executable trajectory.

Object‑centric means focusing on the pose, motion, and contact relationships of target objects rather than the whole image. Residual RL adds a physics‑level correction on top of visual guidance, ensuring the resulting motion respects robot dynamics and environmental constraints.

Humanoid Everyday: Open‑World Data and Benchmark Platform

Humanoid Everyday (A Comprehensive Robotic Dataset for Open‑World Humanoid Manipulation) provides a large‑scale, multimodal dataset and cloud‑based evaluation platform. It contains 260 tasks across 7 task categories , with 10,300 trajectories and over 3 million frames captured in RGB, depth, LiDAR, tactile, and natural‑language annotations. The platform enables reproducible benchmarking of different methods under a unified control environment, addressing the long‑standing evaluation gap in humanoid robot research.

Synthesis and Two Key Judgments

The three works occupy different positions in the data‑model‑world chain: Humanoid Everyday supplies the data foundation, Psi‑0 builds the robot‑native foundation model, and PhysWorld ensures that the model’s predictions are physically actionable. The authors draw two conclusions:

Humanoid robots need a robotics‑native foundation model that respects embodiment, rather than directly porting language‑or‑vision‑centric paradigms.

The most important metric for world models in robotics is physical actionability, not visual fidelity.

Collectively, the papers suggest that progress in humanoid robotics will stem more from integrating data infrastructure, robot‑specific modeling, and physics‑aware prediction than from merely scaling end‑to‑end models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

datasets Embodied AI Foundation Models world models humanoid robotics robot learning

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.