World Models: AI that predicts the physical world
TL;DR
- World Models predict physical states of the world, not words (like LLMs)
- V-JEPA learns from video: masks parts and predicts abstract representations, not pixels
- Results: state of the art in action anticipation, deployed on real robots
- LeCun’s bet: combine LLMs (language) with World Models (physics) for AGI
The problem with LLMs
Current language models (GPT, Claude, Llama) do one thing really well: predict the next word. World Models are one of the AI trends for 2026 I’m tracking closely.
Input: "The sky is..."
Output: "blue" (high probability)
They work. They impress. But they have a fundamental problem: they don’t understand the physical world.
If you ask an LLM “what happens if I drop a ball?”, it knows it “falls” because it’s read millions of texts saying so. Not because it understands gravity.
Yann LeCun puts it this way:
“A house cat has more common sense than GPT-4.”
A cat knows that if it pushes a glass, it falls. Not because it read about physics. Because it’s seen things fall.
What is a World Model
A World Model is an AI system that builds an internal representation of how the physical world works.
Instead of predicting words, it predicts states of the world:
| LLM | World Model |
|---|---|
| ”What word comes next?" | "What happens next in this video?” |
| Learns from text | Learns from video/images |
| Predicts tokens | Predicts physical states |
| Understands language | Understands causality |
The idea isn’t new. In 2018, Ha and Schmidhuber published “World Models,” where an AI learned to play video games by building an internal model of the game.
What’s new is applying it at scale with real-world video.
V-JEPA: Meta’s World Model
V-JEPA (Video Joint Embedding Predictive Architecture) is the World Model LeCun developed at Meta before leaving.
How it works
1. Takes video as input
Not text. Real video of the physical world: people walking, objects falling, hands manipulating things.
2. Divides video into “patches”
Like a transformer divides text into tokens, V-JEPA divides frames into spatiotemporal patches called “tubelets.”
3. Masks parts of the video
Literally blocks out regions of the video. “You can’t see what happens here.”
4. Predicts the hidden parts
But it does NOT predict exact pixels. It predicts an abstract representation of what should be there.
Video: [person raises arm] [███████] [arm up]
↑
What should be here?
V-JEPA predicts: "upward arm movement"
(not the exact pixels, but the concept)
Why representations, not pixels
Here’s the key trick.
Predicting pixels is useless:
- The world has unpredictable details (leaves moving, reflections, noise)
- Forcing the model to predict those details wastes capacity
- Results in blurry outputs that aren’t useful
Predicting abstract representations:
- The model learns structure and causality
- Ignores irrelevant details
- Captures “what’s happening,” not “exactly how it looks”
It’s like the difference between:
- “The ball descended 2.3 meters in 0.7 seconds” (pixels)
- “The ball fell” (abstract representation)
JEPA architecture
JEPA = Joint Embedding Predictive Architecture
┌─────────────────────────────────────────────┐
│ │
│ Video Input │
│ │ │
│ ▼ │
│ ┌───────┐ ┌───────┐ │
│ │Encoder│ │Encoder│ (same encoder) │
│ └───┬───┘ └───┬───┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────┐ ┌───────┐ │
│ │Context│ │ Target│ (hidden parts) │
│ │Embeddings│ │Embeddings│ │
│ └───┬───┘ └───────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌───────┐ │ │
│ │Predictor│───────┘ │
│ └───────┘ │
│ │
│ Goal: prediction matches target │
└─────────────────────────────────────────────┘
Encoder: Converts video into abstract representations Predictor: Given context, predicts representations of hidden parts Goal: Prediction matches actual representation
V-JEPA 2: results
Meta published V-JEPA 2 in 2025. The results:
Training:
- 1 million hours of internet video
- 1 million images
- No human labels (self-supervised)
Benchmarks:
- 77.3% on Something-Something v2 (understanding actions)
- 39.7% on Epic-Kitchens-100 (anticipating actions) - state of the art
- 84.0% on PerceptionTest (video QA) - state of the art at 8B
Robotics:
- Trained with only 62 hours of robot video
- Deployed on real robotic arms
- Capable of pick-and-place without specific training
The model never saw those robots or those objects. But it understands physics enough to plan actions.
Why it matters for robotics
Current robots are programmed with explicit rules:
if object_detected:
move_arm(x, y, z)
close_gripper()
lift()
This is fragile. Any variation breaks the system.
With World Models, a robot can:
- See the situation
- Imagine what happens with different actions
- Choose the action that leads to the desired state
No explicit rules needed. It understands cause and effect.
Limitations (for now)
V-JEPA 2 works well for:
- Short videos (up to ~10 seconds)
- Simple actions (pick and place)
- Controlled environments
It still can’t:
- Plan long-term (minutes, hours)
- Reason about completely new situations
- Fluidly combine language and video
LeCun estimates it’s “a few more years” until complete versions.
AMI Labs: the next step
LeCun left Meta to pursue this vision through his startup, AMI Labs.
Goals:
- Systems that understand the physical world
- Persistent memory (remembering long context)
- Complex action planning
- Causal reasoning
“The goal is to bring the next big revolution in AI: systems that understand the physical world, have persistent memory, can reason and plan complex sequences of actions.”
My take
World Models are a risky bet against current consensus.
The consensus says: “scale LLMs, add more data, add more compute, eventually intelligence will emerge.”
LeCun says: “no, you need a different architecture that understands the physical world.”
Who’s right? Probably both have part of the truth:
- LLMs are brutal for language and knowledge
- World Models could be brutal for physics and planning
- The future probably combines both
What’s interesting is that now there’s a serious, well-funded alternative led by someone with a proven track record.
And that’s good for everyone. Competition of ideas is what advances science.
References
You might also like
Yann LeCun leaves Meta: LLMs won't reach human-level intelligence
The godfather of AI bets $3.5 billion on a different architecture: World Models.
The AI bubble: 7 trillion looking for returns
Who wins, who loses, and why you should care. Analysis of massive AI investment and its bubble signals.
AI is running out of internet to eat
AI models consume data faster than we create it. Quality internet content is almost used up. What comes next?