Artificial Intelligence 8 min read

How Meta’s V‑JEPA 2 Is Pushing AI Toward Human‑Like Physical Understanding

Meta’s newly released V‑JEPA 2 introduces a video‑trained world model that can understand, predict, and plan physical actions, enabling zero‑shot robot control and outperforming existing models on benchmarks like IntPhys 2, MVPBench, and CausalVQA, while outlining future directions for hierarchical and multimodal JEPA architectures.

DataFunTalk

Jun 12, 2025

How Meta’s V‑JEPA 2 Is Pushing AI Toward Human‑Like Physical Understanding

Meta has open‑sourced V‑JEPA 2, a video‑based world model that aims to give AI a human‑like understanding of the physical world.

"We believe world models will usher in a new era for robotics, allowing AI agents to help with household and physical tasks without massive robot training data," says Yann LeCun, Turing Award winner and Meta chief AI scientist.

A world model should possess three core abilities:

Understanding: recognizing objects, actions, and motions in video observations.

Prediction: forecasting how the world will evolve and how it will change when an agent acts.

Planning: using predictions to generate action sequences that achieve given goals.

V‑JEPA 2 is the first video‑trained world model that enables zero‑shot planning in new environments and robot control.

The model is trained with a self‑supervised framework on over one million hours of internet video and images, without any language supervision.

Training proceeds in two stages: first a no‑action pre‑training phase, then an additional action‑conditional training phase.

After training, V‑JEPA 2 achieves state‑of‑the‑art results on several downstream tasks: 77.3% top‑1 accuracy on Something‑Something v2 for action recognition, 39.7% recall@5 on Epic‑Kitchens‑100 for human action prediction, and strong performance on video‑question answering when aligned with large language models.

Meta also released three new benchmarks to evaluate physical understanding from video:

IntPhys 2 – measures the ability to distinguish physically possible from impossible scenarios.

Minimal Video Pairs (MVPBench) – multiple‑choice questions that test physical reasoning while avoiding shortcut solutions.

CausalVQA – assesses understanding of causal relationships, counterfactuals, and planning in video.

Future directions highlighted include developing hierarchical JEPA models that operate across multiple time and space scales, and multimodal JEPA models that incorporate vision, audio, and touch for richer prediction.

Project links: GitHub https://github.com/facebookresearch/vjepa2 ; Hugging Face https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Benchmark robotics self-supervised learning world model Video AI V-JEPA 2

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.