World Models: AI that predicts the physical world

TL;DR

World Models predict physical states of the world, not words (like LLMs)
V-JEPA learns from video: masks parts and predicts abstract representations, not pixels
Results: state of the art in action anticipation, deployed on real robots
LeCun’s bet: combine LLMs (language) with World Models (physics) for AGI

The problem with LLMs

Current language models (GPT, Claude, Llama) do one thing really well: predict the next word. World Models are one of the AI trends for 2026 I’m tracking closely.

Input: "The sky is..."
Output: "blue" (high probability)

They work. They impress. But they have a fundamental problem: they don’t understand the physical world.

If you ask an LLM “what happens if I drop a ball?”, it knows it “falls” because it’s read millions of texts saying so. Not because it understands gravity.

Yann LeCun puts it this way:

“A house cat has more common sense than GPT-4.”

A cat knows that if it pushes a glass, it falls. Not because it read about physics. Because it’s seen things fall.

What is a World Model

A World Model is an AI system that builds an internal representation of how the physical world works.

Instead of predicting words, it predicts states of the world:

LLM	World Model
”What word comes next?"	"What happens next in this video?”
Learns from text	Learns from video/images
Predicts tokens	Predicts physical states
Understands language	Understands causality

The idea isn’t new. In 2018, Ha and Schmidhuber published “World Models,” where an AI learned to play video games by building an internal model of the game.

What’s new is applying it at scale with real-world video.

V-JEPA: Meta’s World Model

V-JEPA (Video Joint Embedding Predictive Architecture) is the World Model LeCun developed at Meta before leaving.

How it works

1. Takes video as input

Not text. Real video of the physical world: people walking, objects falling, hands manipulating things.

2. Divides video into “patches”

Like a transformer divides text into tokens, V-JEPA divides frames into spatiotemporal patches called “tubelets.”

3. Masks parts of the video

Literally blocks out regions of the video. “You can’t see what happens here.”

4. Predicts the hidden parts

But it does NOT predict exact pixels. It predicts an abstract representation of what should be there.

Video: [person raises arm] [███████] [arm up]
                            ↑
                What should be here?

V-JEPA predicts: "upward arm movement"
(not the exact pixels, but the concept)

Why representations, not pixels

Here’s the key trick.

Predicting pixels is useless:

The world has unpredictable details (leaves moving, reflections, noise)
Forcing the model to predict those details wastes capacity
Results in blurry outputs that aren’t useful

Predicting abstract representations:

The model learns structure and causality
Ignores irrelevant details
Captures “what’s happening,” not “exactly how it looks”

It’s like the difference between:

“The ball descended 2.3 meters in 0.7 seconds” (pixels)
“The ball fell” (abstract representation)

JEPA architecture

JEPA = Joint Embedding Predictive Architecture

┌─────────────────────────────────────────────┐
│                                             │
│   Video Input                               │
│       │                                     │
│       ▼                                     │
│   ┌───────┐     ┌───────┐                  │
│   │Encoder│     │Encoder│  (same encoder)  │
│   └───┬───┘     └───┬───┘                  │
│       │             │                       │
│       ▼             ▼                       │
│   ┌───────┐     ┌───────┐                  │
│   │Context│     │ Target│  (hidden parts)  │
│   │Embeddings│  │Embeddings│               │
│   └───┬───┘     └───────┘                  │
│       │             ▲                       │
│       ▼             │                       │
│   ┌───────┐         │                       │
│   │Predictor│───────┘                      │
│   └───────┘                                │
│                                             │
│   Goal: prediction matches target          │
└─────────────────────────────────────────────┘

Encoder: Converts video into abstract representations Predictor: Given context, predicts representations of hidden parts Goal: Prediction matches actual representation

V-JEPA 2: results

Meta published V-JEPA 2 in 2025. The results:

Training:

1 million hours of internet video
1 million images
No human labels (self-supervised)

Benchmarks:

77.3% on Something-Something v2 (understanding actions)
39.7% on Epic-Kitchens-100 (anticipating actions) - state of the art
84.0% on PerceptionTest (video QA) - state of the art at 8B

Robotics:

Trained with only 62 hours of robot video
Deployed on real robotic arms
Capable of pick-and-place without specific training

The model never saw those robots or those objects. But it understands physics enough to plan actions.

Why it matters for robotics

Current robots are programmed with explicit rules:

if object_detected:
    move_arm(x, y, z)
    close_gripper()
    lift()

This is fragile. Any variation breaks the system.

With World Models, a robot can:

See the situation
Imagine what happens with different actions
Choose the action that leads to the desired state

No explicit rules needed. It understands cause and effect.

Limitations (for now)

V-JEPA 2 works well for:

Short videos (up to ~10 seconds)
Simple actions (pick and place)
Controlled environments

It still can’t:

Plan long-term (minutes, hours)
Reason about completely new situations
Fluidly combine language and video

LeCun estimates it’s “a few more years” until complete versions.

AMI Labs: the next step

LeCun left Meta to pursue this vision through his startup, AMI Labs.

Goals:

Systems that understand the physical world
Persistent memory (remembering long context)
Complex action planning
Causal reasoning

“The goal is to bring the next big revolution in AI: systems that understand the physical world, have persistent memory, can reason and plan complex sequences of actions.”

My take

World Models are a risky bet against current consensus.

The consensus says: “scale LLMs, add more data, add more compute, eventually intelligence will emerge.”

LeCun says: “no, you need a different architecture that understands the physical world.”

Who’s right? Probably both have part of the truth:

LLMs are brutal for language and knowledge
World Models could be brutal for physics and planning
The future probably combines both

What’s interesting is that now there’s a serious, well-funded alternative led by someone with a proven track record.

And that’s good for everyone. Competition of ideas is what advances science.