Devstral Small Is Eating GLM 4.7 Flash’s Lunch, And the Benchmarks Never Saw It Coming

The local LLM community has been measuring the wrong thing. For months, we’ve been obsessing over tokens per second like it’s the holy grail of model performance, only to discover that a slower model is quietly beating its “faster” rival where it actually matters: getting the job done.

Devstral Small isn’t winning on speed. It’s winning on intelligence per token, and that changes everything.

The Tokens Per Second Trap

A developer discovered that GLM 4.7 Flash, despite generating tokens nearly three times faster than Devstral Small, was taking longer to complete agentic coding tasks. The culprit? GLM’s “thinking” process was burning through tokens like a chainsaw through balsa wood, generating 3x more total tokens while Devstral Small quietly executed with surgical precision.

This isn’t a minor quirk. It’s a fundamental indictment of how we evaluate local models for real work. We’ve optimized our benchmarks for speedometers when we should be measuring lap times.

The developer’s workflow revealed something critical: when using these models as “code typists” with precise instructions, both handled the job. But Devstral Small’s superior token efficiency meant it finished first, despite its lower raw speed. More importantly, it demonstrated deeper built-in knowledge, correctly leveraging obscure PyTorch APIs without needing to search, while GLM 4.7 Flash burned tokens hunting for answers it should have known.

Why Agentic Coding Is a Different Beast

Agentic workflows amplify token inefficiency exponentially. Every planning step, every tool call, every self-correction multiplies the token cost. A model that needs 500 tokens to plan what another can plan in 100 tokens isn’t just 5x slower, it’s burning 5x the memory, 5x the context window, and 5x the energy for the same outcome.

The numbers tell a stark story. GLM 4.7 Flash’s speed advantage evaporates when you measure time-to-success instead of tokens-per-second. In fact, reports indicate GLM can get stuck in thought loops on difficult problems, generating endless reasoning tokens without making progress. One developer noted they had to “dig deeper to find the start of a solution” just to break GLM out of its recursive thinking, a workaround that shouldn’t be necessary.

This highlights a dirty secret of local LLMs: thinking tokens aren’t created equal. Some models think efficiently, others think verbosely. Current benchmarks can’t tell the difference.

The Quantization Quagmire

The problem runs deeper than model architecture. Community testing reveals that GLM 4.7 Flash’s behavior varies dramatically across quantization levels. Some users report the model starts “looping in its thoughts” on hard problems, with smaller quants exacerbating the issue. Q8 appears stable, but smaller quants introduce instability that can derail entire coding sessions.

Devstral Small, by contrast, maintains more consistent reasoning across quantization scales. This isn’t just about stability, it’s about predictable token efficiency. When you’re running local agents, you need to know how many tokens a task will consume. Unpredictable models break resource planning and force constant monitoring.

The community has already started hacking around these limitations. A fine-tune of GLM 4.7 Flash using Claude Opus 4.5 high reasoning data attempts to compress those verbose thought patterns into more efficient reasoning traces. It’s a clever patch, but it proves the underlying point: raw token generation speed means nothing if your model can’t think efficiently.

Mistral Vibe generating a complete Space Invaders game, when token efficiency matters more than raw speed for complex multi-file tasks.

Benchmarks Are Broken, Here’s How to Fix Them

The real controversy isn’t that Devstral Small beats GLM 4.7 Flash. It’s that our evaluation frameworks completely missed this reality.

Standard benchmarks measure:
– Tokens per second (irrelevant for task completion)
– Single-turn code generation (ignores agentic workflows)
– Idealized conditions (no tool use, no error recovery)

What we should measure:
– Time-to-success: Wall-clock time to working solution
– Token efficiency: Total tokens consumed per task
– Tool call optimization: How well models leverage external tools
– Context window pressure: How quickly models fill their context
– Retry rate: How often models need to regenerate or correct

The SERA framework from Ai2 points toward this future. By training smaller models (8B-32B) on synthetic agentic trajectories, they achieved 54.2% on SWE-bench Verified, competitive with much larger models. Their key insight? Smaller models can match specialized performance when trained efficiently, at a fraction of the compute cost.

This validates what Devstral Small users are discovering: efficiency beats scale when you’re operating under local constraints.

The Hardware Reality Check

Let’s talk about what “local” actually means. Running Devstral 2 (the big sibling) requires serious hardware, hardware requirements for running Devstral models locally are no joke. But Devstral Small? That’s a model you can actually run on consumer hardware without melting your GPU.

This accessibility matters. A model that runs on an RTX 4090 with 24GB VRAM is fundamentally more useful than one that demands an H100 cluster, even if the H100 model is “better” on paper. The local LLM revolution was never about matching data center performance, it was about decentralizing AI development.

GLM 4.7 Flash sits in an awkward middle ground. It’s fast but inefficient, capable but unpredictable. For developers building agentic workflows, that unpredictability is a deal-breaker. You can’t automate what you can’t trust.

The Ecosystem Angle

Mistral understands this shift. Their Mistral Vibe coding agent, powered by Devstral 2, emphasizes architecture-level understanding over raw token generation. The tool scans your entire codebase, maintains context across files, and executes complex refactoring tasks. This is agentic coding where token efficiency directly translates to capability.

The pricing reveals the strategy: Devstral 2 Small costs $0.10/M input tokens versus $0.40/M for the full model. Mistral is betting that efficient small models are the future, not just for local deployment but for cost-effective API usage too.

Meanwhile, GLM-4.7-Flash’s performance in local agentic workflows shows promise but remains hampered by those inefficiency issues. The community is actively fine-tuning and quantizing to make it viable, which speaks volumes about the underlying model quality, but also highlights how much work is needed to make it practical.

The Controversial Take

Here’s what nobody wants to admit: We’ve been optimizing for the wrong metrics because they’re easier to measure.

Tokens per second is simple to benchmark. You can run llama.cpp and get a number. Time-to-success requires building actual agentic workflows, measuring task completion, controlling for prompt variations, and dealing with the messy reality of software development.

But that’s exactly what developers actually care about. The r/LocalLLaMA post that sparked this conversation didn’t include a single tokens-per-second measurement. It focused on task completion time and code quality, the metrics that matter when you’re shipping software.

This is why Devstral Small is a hidden gem. It’s not trying to win the benchmark game. It’s trying to get your code written efficiently, even if that means generating fewer tokens at a slower rate.

The Bigger Picture: Efficiency Over Scale

The Nemotron-3-nano 30B story parallels this perfectly. NVIDIA’s smaller model outperformed Llama 3.3 70B by focusing on efficiency rather than raw scale. The message is clear: parameter count is a poor proxy for capability.

Devstral Small extends this principle to the token level. More tokens ≠ better thinking. Faster generation ≠ faster completion. We’re watching a paradigm shift from “bigger is better” to “efficient is better”, and the benchmarks are struggling to keep up.

This has profound implications for realities of multi-agent AI collaboration vs. marketing claims. If individual agents are token-inefficient, multi-agent systems become exponentially wasteful. The “hive mind” only works when each node operates with precision.

The Fine Print

None of this is to say GLM 4.7 Flash is a bad model. It’s remarkably capable for its size, and the community is actively solving its inefficiencies. The fine-tune using Claude Opus reasoning data shows promise for compressing those verbose thought patterns. Quantization improvements and better parameter tuning can mitigate the looping issues.

But the fundamental insight remains: token efficiency is a first-class metric. A model that thinks in 100 tokens what another needs 500 to think is inherently more valuable for agentic work, regardless of generation speed.

The challenge of running extremely large ‘open’ models locally further underscores this point. When models require data center infrastructure to run, they’re not truly local. Devstral Small operates in the sweet spot: capable enough for real work, efficient enough for consumer hardware.

What Needs to Change

Benchmarks must evolve: We need standardized agentic task suites that measure time-to-success and token efficiency, not just perplexity and tokens/sec.
Model cards should include efficiency metrics: Total tokens consumed per task, context window pressure, tool call optimization scores.
Quantization needs standardization: The variance in GLM 4.7 Flash’s behavior across quants is unacceptable for production use. Models should maintain consistent reasoning patterns across quantization levels.
Community focus should shift: Less benchmarking, more real-world task evaluation. The r/LocalLLaMA post that started this was valuable precisely because it ignored the usual metrics and focused on practical outcomes.

The Bottom Line

Devstral Small isn’t winning because it’s faster. It’s winning because it’s smarter with tokens. In a world where context windows are finite and local hardware has limits, that efficiency translates directly to capability.

The controversy isn’t that GLM 4.7 Flash loses, it’s that our entire evaluation framework failed to capture this reality until a random post pointed it out. We’ve been optimizing for the wrong things, building infrastructure around the wrong metrics, and celebrating the wrong victories.

Maybe it’s time to stop asking “how fast can it generate tokens?” and start asking “how quickly can it ship working code?” The answer might surprise you, and it might just be Devstral Small.

Devstral Small emerges as hidden gem for local agentic coding, Mistral Vibe coding agent — Mistral Vibe generating a complete Space Invaders game, when token efficiency matters more than raw speed for complex multi-file tasks.

For more on the challenges facing Mistral’s ecosystem, see our coverage of Mistral’s Devstral 2 community backlash over testing and integration issues.