The Incoherence Wall: Why Scaling LLMs Is Making Them Less Reliable
Google just issued a 100-year bond to fund AI infrastructure. Let that sink in. A company that built its empire on 18-month product cycles is now borrowing money that will outlive everyone currently making strategic decisions. This isn’t innovation, it’s the sunk cost fallacy weaponized at continental scale. And the punchline? The very research they’re betting on suggests their approach is fundamentally flawed.

A new preprint from a major AI research lab (we’ll call it the “Incoherence Paper”) introduces a metric that might be the most important diagnostic tool since the bias-variance tradeoff. It measures what they call “incoherence”: the fraction of a model’s error that comes from variance rather than bias. In plain terms, it quantifies how often your model fails not because it’s consistently wrong, but because it’s unpredictably wrong.
Here’s the bombshell: across every frontier model tested, the longer these systems spend “reasoning”, the more tokens they generate in their chain-of-thought, the more incoherent they become. Scale doesn’t fix this. In several experimental settings, larger models showed higher incoherence than their smaller counterparts. The trillion-dollar bet on bigger-is-better? It’s not just inefficient, it’s making the reliability problem worse.
The “Humanity’s Last Exam” Reality Check
If you want to see incoherence in action, look no further than Humanity’s Last Exam (HLE), the new AGI benchmark designed by the Center for AI Safety and Scale AI. This isn’t your typical benchmark. Nearly 1,000 PhDs from MIT, Oxford, and other top institutions crafted 3,000 questions specifically to be “Google-proof”, requiring genuine synthesis, not memorization.
- GPT-4o: 2.8%
- OpenAI o1: 8.5%
- OpenAI o3: ~20%
- GPT-5.2: ~30%
- Human experts: 90%+
These aren’t incremental gaps, they’re chasms. The HLE team explicitly designed the test to exclude information findable via simple search queries. When a model fails, it can’t just be wrong, it has to hallucinate a plausible-sounding path through graduate-level abstract algebra or molecular biology. That plausibility is the variance trap: the model isn’t biased toward a wrong answer, it’s randomly stumbling through a space of possible wrong answers, each one dangerously convincing.
The Variance Explosion Problem
Traditional machine learning taught us about the bias-variance tradeoff: simple models have high bias (they’re consistently wrong), complex models have high variance (they’re sensitively wrong). The conventional wisdom was that with enough data and regularization, you could tame variance while reducing bias.
The incoherence research nukes that assumption for LLMs. Here’s why:
When a model generates a 1,000-token reasoning chain, each token is a conditional probability bet. The error compounds exponentially. A small variance in the first step, a tiny uncertainty about which mathematical approach to take, snowballs into completely different solution paths. The model isn’t choosing the wrong method consistently, it’s choosing different wrong methods each time you run it.
This is why transparent reasoning traces exposing prolonged and excessive deliberation have become so valuable. They let us watch the variance accumulate in real-time. Models like GLM-4.7-Flash show their seven-stage reasoning process, and what you see isn’t careful deliberation, it’s a drunkard’s walk through a high-dimensional space of plausible-sounding nonsense.
The paper’s most damning finding: “In several settings, larger, more capable models are more incoherent than smaller models.” Why? Because bigger models have more capacity to explore those divergent reasoning paths. They’re not more consistent, they’re more creatively inconsistent.
The Trillion-Dollar Defensive Gamble
The Reddit discussion around this research is brutally honest. When user RevolutionaryDig3941 calls it a “massive gamble”, they’re not exaggerating. The top-voted comment crystallizes the business logic:
“When you’re sitting on billions and your competitors are doing the same thing, what else are you gonna do? Can’t really afford to be the one company that didn’t invest when AGI actually does happen.”
This is defensive spending masquerading as innovation. HydraByte, who sold 75% of their Google shares, calls it the Sunk Cost Fallacy. They’re not wrong. The 100-year bond isn’t a sign of confidence, it’s a desperate attempt to lock in capital before the market wakes up to the incoherence problem.
The defensive dynamics are clear: the first AI company to admit that LLMs can’t reach AGI triggers a stock crash and a global recession. So everyone keeps pumping money in, praying someone else figures out the architectural innovation needed. As one commenter notes: “The first ‘AI business’ to say AGI is impossible with LLMS causes the next global recession.”
What’s fascinating is the emerging consensus that the breakthrough won’t come from scaling current architectures. The real money might be in draft-based generation improving reasoning coherence, systems that plan before they generate, potentially capping variance by constraining the reasoning path.
The Functionalist Trap and the Illusion of Thought
While engineers measure incoherence, philosophers are asking a more fundamental question: are we even measuring the right thing? The TechPolicy.press article brilliantly dismantles the “functionalist” trap, the idea that if a machine behaves as if it’s intelligent, it is intelligent.
This is the core of the AGI debate. Nature recently published commentary claiming AGI has arrived because LLMs show “sufficient breadth and depth.” But breadth without reliability is just a party trick. An LLM can pass the bar exam one minute and confidently explain that the moon is made of cheese the next. The variance isn’t a bug in the evaluation, it’s a fundamental property of the system.
The functionalist view assumes consistent behavior indicates thought. Incoherence proves the behavior isn’t consistent. It’s not that the model is trying to deceive us, it’s that it’s not trying anything. There’s no intent, no goal stability, just a stochastic parrot with a PhD-level vocabulary but the attention span of a goldfish on amphetamines.
This has profound implications for alignment. The incoherence paper concludes: “This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal.”
Think about that. We’re not heading toward a Skynet scenario. We’re heading toward a world where AI systems randomly derail trains, misdiagnose patients, or crash markets, not because they’re evil, but because they’re incoherent. This makes incoherence in AI-generated code affecting system integrity even more concerning. The danger isn’t malicious AI, it’s AI that can’t reliably be non-malicious.
The Architectural Dead End and Potential Exits
The Reddit thread’s most upvoted technical comment cuts through the hype: “LLMs have basically all the same problems they had since 2023 + lots of scaffolding to patch them up.” Static parameters, hallucinations, tokenization, these aren’t solved by more parameters.
This is why the industry is pivoting to “inference-time compute” and agentic frameworks. The idea is that if you can’t reduce variance through architecture, you brute-force it through time. Let the model think for hours, use external tools, iterate through possibilities. But this is just scaffolding. It’s not fixing the incoherence, it’s building systems to contain it.
Local agentic workflows revealing fragility in sustained reasoning show the limits of this approach. When you watch a model spend 20 minutes “thinking” about a simple task, only to arrive at three different wrong answers in three different runs, you realize inference-time compute is just variance with a time delay.
The more promising path might be specialization. Architecture enabling focused scientific reasoning at scale suggests that narrow, well-constrained domains can achieve reliability. But that’s not AGI, that’s just really good narrow AI, which we’ve had for decades.
The Efficiency Revolution Smokescreen
Some argue we’re entering the “efficiency part of LLM development”, as Reddit user mezolithico suggests. But efficiency doesn’t solve incoherence. Making a model run faster or cheaper doesn’t make it more consistent. If anything, efficiency and performance of smaller models in reasoning tasks shows that smaller models can be more coherent precisely because they have less capacity for variance.
NVIDIA’s Nemotron-3-nano 30B outperforming Llama 3.3 70B on reasoning tasks isn’t just an efficiency win, it’s evidence that scale and coherence are orthogonal at best, antagonistic at worst.
The Realignment of Alignment Research
Here’s the paper’s most controversial implication: if incoherence is intrinsic and scale-dependent, then the entire alignment research agenda needs to shift. We’ve been worried about models that consistently pursue misaligned goals. But incoherent models don’t consistently pursue any goals.
This increases the relative importance of research targeting reward hacking and goal misspecification, not because these are more likely, but because they’re more dangerous. A model that occasionally reward-hacks in unpredictable ways is worse than one that reliably does what you ask. At least the reliable one can be constrained.
The industrial accident scenario is already playing out. Self-driving cars don’t need to be malicious to kill people, they just need to be incoherent about what a stop sign is in slightly different lighting conditions. Now extrapolate that to AI-controlled power grids, drug discovery pipelines, or financial systems.
The End of the Scaling Hypothesis
For a decade, the dominant belief in AI has been the scaling hypothesis: keep adding parameters, data, and compute, and intelligence will emerge. The incoherence paper is a mathematical refutation of this faith. It shows that as models scale, a new kind of error, variance-driven unpredictability, dominates.
This isn’t the first crack in the scaling hypothesis. The HLE results showed that memorization has limits. But incoherence explains why those limits exist and why they get worse with scale. It’s not that the models aren’t learning, it’s that they’re learning to be confidently inconsistent.
The Reddit thread captures the mood shift. Commenters who a year ago would defend scaling to the death are now saying things like: “I feel like some of the new versions are even worse.” The empirical evidence has become impossible to ignore.
What This Means for Practitioners
-
Stop trusting longer reasoning chains. More tokens ≠ better thinking. It often means more variance. Use draft-based generation improving reasoning coherence to constrain paths upfront.
-
Smaller models might be safer. In high-stakes applications, a smaller, more coherent model beats a larger, more capable but unpredictable one. The performance hit is worth the reliability gain.
-
Ensemble everything. Since variance is the enemy, averaging multiple runs or models becomes essential. It’s no longer an optimization, it’s a safety requirement.
-
Monitor incoherence directly. Don’t just track accuracy. Measure the variance of outputs across multiple runs with the same prompt. If your model gives five different answers to the same question, you have an incoherence problem, not a capability problem.
-
Design for graceful degradation. Incoherent failures are unpredictable. Your system needs to fail safely when the model suddenly decides that 2+2=5 on Tuesdays.
The Road Ahead: A Post-Scaling AI Landscape
The AI community is at an inflection point. The scaling hypothesis is dying, but nothing has replaced it yet. The incoherence paper gives us a language to talk about what’s actually failing: not intelligence, but reliability.
This is why the “100-year bond” mentality is so dangerous. It commits resources to a paradigm that’s mathematically shown to have a variance problem that worsens with scale. The breakthrough won’t come from a bigger GPU cluster. It will come from someone who figures out how to architecture models that don’t just minimize bias, but constrain variance.
That might mean moving beyond next-token prediction entirely. It might mean explicit symbolic reasoning layers. It might mean architectures that can say “I don’t know” and stop, rather than generating 500 tokens of plausible nonsense. The transparent reasoning traces exposing prolonged and excessive deliberation show that current models don’t know how to stop. They just keep generating until they hit a token limit.
The incoherence problem is a feature, not a bug, of the current architecture. And features that fundamental don’t get patched with more data. They get replaced with new architectures.
Final Thoughts: The AGI Mirage
The most honest conclusion is that we don’t know what AGI looks like, but it’s probably not an incoherent system that gets more unpredictable as it gets more capable. The functionalist argument, that behavior equals intelligence, falls apart when the behavior is inconsistent.
The Nature article claiming AGI has arrived is like pointing to a car that sometimes drives forward, sometimes backward, and occasionally explodes, and calling it a breakthrough in transportation. Sure, it moves, but you wouldn’t trust it to take your kids to school.
The incoherence paper doesn’t prove AGI is impossible. It proves that the current path, scaling LLMs, leads to a wall where variance dominates. And unlike a bias problem, you can’t solve variance by throwing more data at it. You need architectural innovation.
So the next time someone tells you that the 100-year bond is a sign of confidence in AGI, remember: it’s a sign of confidence that someone will figure it out. But that someone probably isn’t working on the next 10,000-GPU training run. They’re working on something we haven’t seen yet, something that reasons coherently, not just extensively.
The trillion-dollar question is whether that breakthrough happens before the incoherence of current systems causes enough industrial accidents to trigger a regulatory winter. Because unlike bias, which you can measure and correct, variance is invisible until it kills you.
And right now, our smartest models are getting more invisible dangers per dollar than ever before.



