More tokens doesn't mean better results

· 2 min read
Share:

TL;DR

  • Tested a research-level problem (random walk on 2D torus)
  • More web searches, more tokens, more instructions → same error
  • The model consumed so many resources the chat errored with “overflow”
  • Lesson: when it doesn’t understand the problem, more resources = more rationalization

The problem

I tried a research-level problem: calculate the probability that a random walk on a 2D torus visits the origin before returning to the starting point.

Correct answer: e^(-π/2) ≈ 0.208

SetupResult
No toolsMade up formulas (0, 1/e, 1/2)
With internetFound correct theory, extracted a wrong value
With hint “that value is wrong”Fixed value, misapplied the formula

Each layer of tools helped partially but introduced new errors.

Strategy 2: Exhaustive meta-prompt

I designed a prompt that instructed:

  • Search multiple sources
  • Verify every extracted value
  • Compare results between papers
  • Only respond when everything matches

Result: The model made so many searches and compactions that the chat errored: “no more compacts allowed.” First time I’d seen this.

And the final answer after consuming massive resources: 1/2 (incorrect, the same simple heuristic).

Why it happened

The model used an elegant but incorrect argument:

“The origin and x₀ share 2 of 4 neighbors, so the probability is 1/2”

When it doesn’t understand the underlying problem, more resources just mean more space to rationalize the wrong answer.

Lesson

More XBetter result?
More thinking tokens❌ If it doesn’t know, it rationalizes
More web searches⚠️ Can extract wrong data
More compactions❌ Loses useful context
More instructions❌ Can ignore them

Prompt engineering has a ceiling. For problems that require specialized technical knowledge the model doesn’t have, no prompt solves it.


This is the third experiment in the series. It started with the model that wouldn’t commit.

To know when to use elaborate prompts and when not to, read my taxonomy of LLM failures.

This post is part of my series on the limits of prompting. For a complete view, read my prompt engineering guide.

Found this useful? Share it

Share:

You might also like