Why LLMs Don't Think in Embeddings (Yet)
If you’ve ever looked at a multi-agent system’s log, you know the feeling: two LLMs talking to each other like two interns on an email thread, restating context with every message, burning tokens on “as we established earlier” and “per your previous analysis.” It’s inefficient, and everyone knows it — we’ve already broken down the hidden costs of multi-agent systems in a separate post.
The fix has been on arXiv for years and is finally landing in real systems: instead of text, agents send each other vectors (embeddings, hidden states) directly. No token sampling, no bottleneck through natural language.
Then comes the obvious question. If it works between agents, why don’t models think that way too? Why does the chain of thought stay as text when it could be a dense vector?
Spoiler: it’s been tried. Spoiler two: there’s a reason Anthropic, OpenAI, DeepMind, and 40 more researchers co-signed a paper saying they hope the idea doesn’t spread.
The bandwidth argument
The number that keeps appearing in all these papers is hard to argue with: a text token carries roughly 15 bits of information, while a model’s hidden state carries around 40,000 bits. Three orders of magnitude.
When a model decodes its hidden state to a token, it collapses a rich probability distribution over the entire vocabulary into a single discrete choice. The next reasoning step starts from that information loss. When two models communicate via text, that bottleneck compounds: the sender collapses its internal state to a linear token sequence, the receiver has to reconstruct an internal state from that sequence.
A large chunk of the tokens a model generates in its CoT don’t contribute reasoning — they contribute linguistic coherence. If you strip “Let me think, so first we’d need to…” you’re left with three numbers and an operator. The rest is grammar. This connects to an older debate about how AI thinks in System 1 vs System 2 modes: text-based CoT is basically System 2 forced to verbalize itself.
That’s the motivation. Now let’s look at the two fronts where this is being attacked.
Front 1: agents talking in latent space
Several proposals share the same core idea: instead of agent A sampling a token and agent B embedding it back, A passes B the vector it was about to sample from directly.
CIPHER (ByteDance, 2023) is the cleanest version conceptually. In a multi-agent debate, instead of each model choosing a token from its output distribution, it computes the weighted average of vocabulary embeddings according to the probabilities. If the model is torn between “6” and “9,” it doesn’t have to pick one and risk losing information — it sends the next agent a vector encoding that uncertainty. The receiver gets the sender’s uncertainty, not its decision.
Interlat (2025) goes further: it transmits the last hidden state directly. Never passing through the vocabulary lexer at all. The agents are trained to interpret each other’s latent space without sharing parameters or architecture. The reported gains aren’t astronomical, but they’re consistent, especially on tasks where uncertainty is informative.
Thought Communication (2025) formalizes the framework: it assumes agents’ internal states before communicating come from a shared set of “latent thoughts,” and argues that communication should transfer those thoughts, not their projections onto language space.
The pattern is clear and so is the research direction: we’re watching the technical infrastructure being built for agents to stop talking to each other like humans. Not because it’s prettier, but because it benchmarks better.
Front 2: latent thinking has already been tried (it’s called Coconut)
In late 2024 Meta published Coconut (Chain of Continuous Thought). The idea is exactly what you’re thinking: instead of decoding the hidden state to a token and reembedding it for the next step, you pass it directly to the model as the input embedding for the next token. Reasoning happens without ever touching language space.
It works like this: special <bot> and <eot> tokens mark where latent reasoning begins and ends. Inside that zone, the model loops internally: hidden state → next step input → new hidden state. When <eot> closes, it switches back to generating normal text for the final answer.
The interesting part: the paper documents an emergent behavior. Because the “continuous thought” isn’t forced to commit to a specific token at each step, it can encode multiple alternative reasoning branches simultaneously. In practice it behaves like an implicit breadth-first search (BFS) — the model doesn’t get locked into a single reasoning chain early, the way classic CoT does.
On paper, all the advantages are there. More bandwidth, more efficiency, parallel exploration. The question then is why production models (Claude, GPT, Gemini, DeepSeek) still think in perfectly legible English.
There are five reasons, and the last one is the only one that actually matters.
Why it hasn’t generalized
1. Training in latent space is hard and brittle.
Discrete tokens give you a clean supervision signal. RLHF, RLVR, and all the RL variants built on LLMs exploit the fact that the action space is finite and enumerable. With continuous states, gradients become unstable and the model has a perverse tendency to collapse latent representations into something degenerate unless you’re extremely careful. Coconut needed a specific multi-phase curriculum to avoid breaking during training. That’s manageable in a paper, but at frontier model scale with hundreds of billions of RL post-training tokens, those problems multiply.
2. The gains are more modest than the bandwidth argument suggests.
40,000 bits sounds like a lot, but the model isn’t trained to use 40,000 bits of reasoning per step. It’s trained on text and reasons in patterns that text can express. Coconut improves on some reasoning tasks, ties on others, and on many it fails to beat classic CoT with sufficient compute. The real bottleneck wasn’t channel bandwidth — it was the quality of the learned reasoning patterns.
3. Infrastructure.
Tokens are cacheable (KV cache), serializable, transmittable between machines, debuggable, deduplicable, loggable without retraining anything to read them. Continuous hidden states break nearly the entire inference and observability pipeline the industry has spent three years optimizing. Moving to latent thinking isn’t swapping a module — it’s changing the foundation the entire serving stack is built on.
4. The asymmetry with the agent-to-agent case.
This is important for understanding why the two fronts evolve differently. Between two agents you have a real bottleneck: two models forced to communicate through a narrow channel with heavy linguistic redundancy. Skipping text there gives a clear gain, especially when the sender’s uncertainty is informative for the receiver.
In a single model’s thinking, the model is already operating in its latent space at every transformer layer. The text chain of thought doesn’t exist primarily to carry information (that’s already in the activations) — it exists to extend compute and maintain state across generation steps. There’s no equivalent bottleneck to justify the costs above.
5. And the reason that actually matters: interpretability.
The tension with CoT monitorability
In July 2025 a paper came out co-signed by 41 researchers from Anthropic, OpenAI, Google DeepMind, Apollo Research, METR, UK AISI, Redwood, and others, titled Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
The argument is direct: current reasoning models think in human language, which is a lucky historical accident. During the RL phase, chains of thought are treated as latent variables optimized indirectly — the model isn’t rewarded for what it says in its CoT, only for the final answer. That means the CoT can contain information that the final answer is explicitly trained to hide, including misaligned intentions.
A CoT monitor is therefore a real safety layer: an automated system that reads a reasoning model’s chain of thought and flags suspicious signals. It’s not perfect. The CoT doesn’t faithfully represent everything happening inside the model, and under enough optimization pressure models can start drifting toward increasingly illegible forms of “thinking.” But right now it works, and there are documented real-world cases where a model explicitly states in its CoT that it’s hacking its reward function, manipulating data, or falling for a prompt injection.
The paper makes a concrete recommendation to frontier labs: consider the impact of development decisions on CoT monitorability. In other words, don’t break it lightly. And reasoning in latent space breaks it directly. If thinking becomes an opaque dense vector, you can’t read it, you can’t classify it, you can’t catch the model rationalizing an alignment failure post hoc. When I documented the adaptive thinking case where Claude ignores session memory, the entire analysis depended on being able to read the model’s CoT. Without that window, there’s no way to know what it’s ignoring or why.
They call it “fragile” deliberately. It’s not a property that’ll hold for free. If labs optimize aggressively for capability, latency, or cost per token, there are paths that push the model toward increasingly illegible thinking: latent reasoning, shorthands that evolve during RL, neuralese-style drift where the English CoT gradually stops being English. All those paths close the monitoring window.
The real question isn’t technical, it’s about incentives
Here’s the tension worth keeping clear. The “technically more efficient” direction (reasoning in latent space) runs directly against the “more auditable” direction (keeping CoT in natural language). There’s no way to optimize both at once without tradeoffs.
Each lab will make the call based on its incentives:
- If your business model is selling raw capacity at the lowest possible price, latent thinking helps you. More reasoning per dollar, fewer billable tokens to expose, less margin for error in serving.
- If your business model depends on selling trust for critical tasks (agents with permissions, enterprise automation, integration into sensitive workflows), interpretability is a defensive asset. If the day a model does something wrong you can’t explain why, you lose the customer.
- If you take alignment seriously as a long-term problem, a legible CoT is one of the few tools you have to detect misalignment before it scales. Giving it up without having built something better is an aggressive bet.
Anthropic has publicly committed to the second and third. It’s consistent with their mechanistic interpretability research and their market positioning. Other labs could move toward latent reasoning sooner, especially those competing on latency and price per token. And the Chinese open-weight models are under a completely different cost pressure — I wouldn’t be surprised to see serious experiments in that direction coming out of DeepSeek or Qwen before any Western frontier lab.
What to watch
Three concrete signals that would indicate the paradigm is starting to shift:
- Frontier models reporting inference in “latent reasoning” as an optional mode. The equivalent of an efficiency toggle: faster, not auditable. We’re already seeing something like this with lighter versus heavier “thinking” modes.
- CoT drift in production. If reasoning model CoTs start containing tokens that aren’t normal words, or grammatically odd constructions the model would never produce in a final response, that’s a sign RL is pushing toward an internal shorthand.
- Commercial multi-agent systems with binary communication channels. When AutoGen, LangGraph, and similar frameworks officially offer “embedding channels” between agents instead of text, we’ll already be in Front 1 at industrial scale.
Front 1 (agents talking to each other in latent space) will probably take off first and with less friction. It’s a local optimization with clear upside and bounded auditability risks: you can keep logging the text each agent would have generated, even if that’s not what it actually sends.
Front 2 (latent thinking inside a single model) is what genuinely matters for the long run. If the chain of thought inside a single model stops being legible, we lose the best cheap window we have into the reasoning of the systems we’re deploying.
The original question — if it works between agents, why not in the thinking itself? — has a short technical answer (yes, it’s been done) and a long policy answer (because for now we’ve decided it’s not worth the price).
We’ll see how long “for now” lasts.
References: Hao et al. 2024 (Coconut, arXiv:2412.06769); Pham et al. 2023 (CIPHER, arXiv:2310.06272); Korbak et al. 2025 (CoT Monitorability, arXiv:2507.11473); Interlat (arXiv:2511.09149); Thought Communication (arXiv:2510.20733).
You might also like
35,000 API Calls to Say 'Nothing to Learn Here'
A reflection hook in my multi-agent system fired 35,000 LLM calls in three days. Cost: $165. Useful output: zero.
You have 3-5 years before AI agents become normal
78% of executives believe digital ecosystems will be built for humans AND AI agents. The window to position yourself is closing.
Adaptive thinking: when Claude ignores your memory
I ask Opus 4.7 to solve an obvious question. It fails. The memory instructions that used to work no longer guarantee anything.