AI is running out of internet to eat
TL;DR
- AI models consume data faster than we generate new content
- Quality internet data for training is almost depleted
- The solution: synthetic data (AI training on AI-generated data)
- This changes the rules of the game for everyone
There’s a problem nobody’s talking about.
Large language models (GPT, Claude, Gemini, etc.) are trained on internet text. Books, articles, forums, Wikipedia, code, documents…
The problem: they’ve already consumed almost all of it.
And we generate new content slower than AI can process it.
The numbers
The World Economic Forum warned in late 2025: high-quality data for AI training is running out.
It’s not that there’s no data. There’s more data than ever. But:
- Most of it is garbage (spam, duplicates, low-quality AI-generated content)
- The good stuff is already used
- New content we generate can’t feed increasingly larger models
It’s like having a whale that needs tons of krill daily, and the ocean is running out of krill.
The solution (and why it’s weird)
The solution they’re using: synthetic data.
Meaning: AI generating data to train other AI.
Sounds like a snake eating its tail. And partly it is. But Microsoft (SynthLLM project) has shown it works when done right.
LQMs (Large Quantitative Models) are also emerging. Unlike LLMs that learn from historical text, LQMs learn from equations and physical principles. They can simulate results without needing real data.
Why you should care
If you work with data or AI, this affects you:
1. Model quality may plateau
If there’s no new quality data, models don’t improve. Or improve slower. The exponential curve of “every 6 months there’s a better model” might slow down.
2. Your internal data is worth more
Companies with proprietary quality data (not public on the internet) have an advantage. That data isn’t “contaminated” or used. It’s gold.
3. Original human content is scarce
Ironically, the more content AI generates, the less original human content exists. And original content is what they need to improve.
The absurd cycle
- AI generates content
- Humans publish AI-generated content
- AI trains on that content
- AI generates content based on AI content
- Progressive degradation
It’s like making a photocopy of a photocopy of a photocopy. Each generation loses quality.
What this means for the future
Short term: nothing changes. Current models are already trained.
Medium term:
- We’ll see more specialized models (trained on high-quality niche data)
- Synthetic data will become the norm
- Companies with proprietary data will have competitive advantage
Long term: nobody knows. But the era of “train on the entire internet” is coming to an end.
The reflection
We’ve spent years worried that AI will take our jobs.
Turns out AI has a more basic problem: it’s running out of food.
I’m not saying this will stop progress. But it does change the rules. And as always, whoever understands the new rules first, wins.
You might also like
The AI bubble: 7 trillion looking for returns
Who wins, who loses, and why you should care. Analysis of massive AI investment and its bubble signals.
95% see no results with AI (and why that's normal)
The J-curve of adoption nobody tells you about. Why productivity drops before it rises when you adopt AI.
36% of Spanish companies now use AI: what it means for you
AI adoption data in Spain in 2025 and what it means for those who work with data. A view from Andalusia.