Artificial Intelligence 8 min read

How Meta’s V‑JEPA 2 Is Pushing AI Toward Human‑Like Physical Understanding

Meta’s newly released V‑JEPA 2 introduces a video‑trained world model that can understand, predict, and plan physical actions, enabling zero‑shot robot control and outperforming existing models on benchmarks like IntPhys 2, MVPBench, and CausalVQA, while outlining future directions for hierarchical and multimodal JEPA architectures.

DataFunTalk
DataFunTalk
DataFunTalk
How Meta’s V‑JEPA 2 Is Pushing AI Toward Human‑Like Physical Understanding

Meta has open‑sourced V‑JEPA 2, a video‑based world model that aims to give AI a human‑like understanding of the physical world.

"We believe world models will usher in a new era for robotics, allowing AI agents to help with household and physical tasks without massive robot training data," says Yann LeCun, Turing Award winner and Meta chief AI scientist.

A world model should possess three core abilities:

Understanding: recognizing objects, actions, and motions in video observations.

Prediction: forecasting how the world will evolve and how it will change when an agent acts.

Planning: using predictions to generate action sequences that achieve given goals.

V‑JEPA 2 is the first video‑trained world model that enables zero‑shot planning in new environments and robot control.

The model is trained with a self‑supervised framework on over one million hours of internet video and images, without any language supervision.

Training proceeds in two stages: first a no‑action pre‑training phase, then an additional action‑conditional training phase.

After training, V‑JEPA 2 achieves state‑of‑the‑art results on several downstream tasks: 77.3% top‑1 accuracy on Something‑Something v2 for action recognition, 39.7% recall@5 on Epic‑Kitchens‑100 for human action prediction, and strong performance on video‑question answering when aligned with large language models.

Meta also released three new benchmarks to evaluate physical understanding from video:

IntPhys 2 – measures the ability to distinguish physically possible from impossible scenarios.

Minimal Video Pairs (MVPBench) – multiple‑choice questions that test physical reasoning while avoiding shortcut solutions.

CausalVQA – assesses understanding of causal relationships, counterfactuals, and planning in video.

Future directions highlighted include developing hierarchical JEPA models that operate across multiple time and space scales, and multimodal JEPA models that incorporate vision, audio, and touch for richer prediction.

Project links: GitHub https://github.com/facebookresearch/vjepa2 ; Hugging Face https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6

benchmarkRoboticsself-supervised learningworld modelvideo AIV-JEPA 2
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.