More tokens doesn't mean better results
TL;DR
- Tested a research-level problem (random walk on 2D torus)
- More web searches, more tokens, more instructions → same error
- The model consumed so many resources the chat errored with “overflow”
- Lesson: when it doesn’t understand the problem, more resources = more rationalization
The problem
I tried a research-level problem: calculate the probability that a random walk on a 2D torus visits the origin before returning to the starting point.
Correct answer: e^(-π/2) ≈ 0.208
Strategy 1: Web search
| Setup | Result |
|---|---|
| No tools | Made up formulas (0, 1/e, 1/2) |
| With internet | Found correct theory, extracted a wrong value |
| With hint “that value is wrong” | Fixed value, misapplied the formula |
Each layer of tools helped partially but introduced new errors.
Strategy 2: Exhaustive meta-prompt
I designed a prompt that instructed:
- Search multiple sources
- Verify every extracted value
- Compare results between papers
- Only respond when everything matches
Result: The model made so many searches and compactions that the chat errored: “no more compacts allowed.” First time I’d seen this.
And the final answer after consuming massive resources: 1/2 (incorrect, the same simple heuristic). This connects to how AI thinks: System 1 vs System 2.
Why it happened
The model used an elegant but incorrect argument:
“The origin and x₀ share 2 of 4 neighbors, so the probability is 1/2”
When it doesn’t understand the underlying problem, more resources just mean more space to rationalize the wrong answer.
Lesson
| More X | Better result? |
|---|---|
| More thinking tokens | ❌ If it doesn’t know, it rationalizes |
| More web searches | ⚠️ Can extract wrong data |
| More compactions | ❌ Loses useful context |
| More instructions | ❌ Can ignore them |
Prompt engineering has a ceiling. For problems that require specialized technical knowledge the model doesn’t have, no prompt solves it.
This is the third experiment in the series. It started with the model that wouldn’t commit.
To know when to use elaborate prompts and when not to, read my taxonomy of LLM failures.
This post is part of my series on the limits of prompting. For a complete view, read my prompt engineering guide.
Consulting
Got a similar problem with AI Integrations?
I can help. Tell me what you're dealing with and I'll give you an honest diagnosis — no commitment.
See consulting →You might also like
Prompt for ambiguous problems (3 coins puzzle)
3 coins, P(heads)=1/3, tails always even — 0 or 1/13? A prompt that forces LLMs to find the correct interpretation.
Why LLMs Reject Their Own Correct Answers
Two-Box separates contexts so the LLM reviews without bias. Problem: counterintuitive answers get discarded.
Prompt Engineering Guide: How to Talk to LLMs
Everything you need to know to write effective prompts. From beginner to advanced, with practical examples and the limits nobody tells you about.