More tokens doesn't mean better results

TL;DR

Tested a research-level problem (random walk on 2D torus)
More web searches, more tokens, more instructions → same error
The model consumed so many resources the chat errored with “overflow”
Lesson: when it doesn’t understand the problem, more resources = more rationalization

The problem

I tried a research-level problem: calculate the probability that a random walk on a 2D torus visits the origin before returning to the starting point.

Correct answer: e^(-π/2) ≈ 0.208

Strategy 1: Web search

Setup	Result
No tools	Made up formulas (0, 1/e, 1/2)
With internet	Found correct theory, extracted a wrong value
With hint “that value is wrong”	Fixed value, misapplied the formula

Each layer of tools helped partially but introduced new errors.

Strategy 2: Exhaustive meta-prompt

I designed a prompt that instructed:

Search multiple sources
Verify every extracted value
Compare results between papers
Only respond when everything matches

Result: The model made so many searches and compactions that the chat errored: “no more compacts allowed.” First time I’d seen this.

And the final answer after consuming massive resources: 1/2 (incorrect, the same simple heuristic).

Why it happened

The model used an elegant but incorrect argument:

“The origin and x₀ share 2 of 4 neighbors, so the probability is 1/2”

When it doesn’t understand the underlying problem, more resources just mean more space to rationalize the wrong answer.

Lesson

More X	Better result?
More thinking tokens	❌ If it doesn’t know, it rationalizes
More web searches	⚠️ Can extract wrong data
More compactions	❌ Loses useful context
More instructions	❌ Can ignore them

Prompt engineering has a ceiling. For problems that require specialized technical knowledge the model doesn’t have, no prompt solves it.

This is the third experiment in the series. It started with the model that wouldn’t commit.

To know when to use elaborate prompts and when not to, read my taxonomy of LLM failures.

This post is part of my series on the limits of prompting. For a complete view, read my prompt engineering guide.

More tokens doesn't mean better results

TL;DR

The problem

Strategy 1: Web search

Strategy 2: Exhaustive meta-prompt

Why it happened

Lesson

You might also like

The prompt that solves ambiguous problems

It got to 0 and called it a contradiction

Prompt Engineering Guide: How to Talk to LLMs