GLM-4.7-Flash: The Local LLM That Actually Does What It Promises (Mostly)

GLM-4.7-Flash: The Local LLM That Actually Does What It Promises (Mostly)

GLM-4.7-Flash is delivering reliable agentic performance on consumer hardware, but the path to getting it running reveals the messy reality of local AI deployment.

by Andre Banandre

GLM-4.7-Flash: The Local LLL That Actually Does What It Promises (Mostly)

The local LLM community has a new obsession, and for once, it might be justified. GLM-4.7-Flash isn’t just another quantized model promising the moon and delivering a buggy mess. Early reports show it actually handles agentic workflows, cloning repos, running commands, editing files, without the typical MoE meltdown. But getting there required a community sprint that exposes exactly how fragile the local AI ecosystem remains.

The Promise: Agentic Reliability on Modest Hardware

What makes GLM-4.7-Flash genuinely interesting isn’t the benchmark numbers (though those are solid). It’s that developers are running it for hours in agentic frameworks like OpenCode without a single tool-calling error. One developer reported generating “hundreds of thousands of tokens in one session with context compacting” while the model cloned repositories, executed commands, and committed changes flawlessly.

The architecture backs this up: a 30B-A3B MoE (that’s 30 billion total parameters, 3 billion active) with a 200,000-token context window. On paper, it’s the same size class as Qwen3-30B-A3B, but the performance characteristics tell a different story. The model hits 95 tokens/second on short prompts with dual RTX 3090s, settling to a steady 80 t/s. Even with 15K context, it maintains a usable 20-30 t/s, speeds that make actual interactive development possible.

Performance comparison showing GLM-4.7-Flash hitting 95 t/s on short prompts
Performance comparison showing GLM-4.7-Flash hitting 95 t/s on short prompts

The benchmarks confirm the real-world sentiment. In the Unsloth evaluation, GLM-4.7-Flash scores 59.2% on SWE-bench Verified and 79.5% on τ²-Bench, both measuring multi-turn agentic performance. That’s not just good, it’s competitive with models twice its size and miles ahead of the 22% SWE-bench score from Qwen3-30B.

The Reality: A Deployment Saga

Here’s where the story gets messy. The model dropped on Hugging Face, and immediately the community hit walls. The official vLLM implementation? Broken in the nightly Docker images. Building from scratch? A day-long download on hotel WiFi. As one developer put it after wasting an afternoon: “Really should have just trusted the llama.cpp gang to get there first.”

And they did. Within hours, a community member submitted PR #18936 to llama.cpp, adding support for the Glm4MoeLite architecture. The implementation treats it as “just a renamed version of DeepSeekV3 with some code moved around”, which turned out to be both accurate and problematic.

The Flash Attention Trap

The first major gotcha: Flash Attention is broken. Not just slow, actually broken. Multiple developers reported 3x performance decreases with -fa 1 compared to -fa 0. The issue appears architecture-specific, as one contributor noted, “the model is only running with -fa off (or it disables it if set to auto). Explicitly setting it to -fa on seems to fall back to CPU.”

This isn’t a CUDA-only problem. AMD users on ROCm see the same slowdown. The root cause seems to be missing MLA (Multi-head Latent Attention) kernel support for GLM-4.7-Flash’s specific tensor dimensions. The llama.cpp team acknowledged this is “above my pay-grade at this point” and requires a separate PR to fix.

The Quantization Quagmire

Then came the quantization issues. Early GGUF conversions crashed with segmentation faults or produced garbled output. The problem? Tokenizer mismatches. GLM-4.7-Flash uses 154,856 tokens vs GLM-4.5’s 151,365, and the special token IDs don’t align. The <|user|> token is supposed to be an EOS signal, but early GGUFs only set <|endoftext|>, causing infinite generation loops where the model talks to itself:

Wait - looking at `User` (the first line of my internal thought block)...
Ah, I see. In this specific simulation/turn-based interface:
1. User says "hello"
2. Model responds with weirdness
3. **Current Turn**: The user is sending the exact same command...

The fix required hardcoding <|user|> and <|observation|> as end-of-generation tokens in llama.cpp’s vocabulary handling. But this created a new problem: quantized versions started looping or producing gibberish. The Unsloth team eventually recommended BF16 as the only stable format, with specific sampling parameters (--temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1) to reduce repetition.

The Numbers That Matter

Let’s talk concrete performance. On a dual RTX 3090 setup, the model delivers:

Metric Performance
Short prompts ~95 t/s initial, 80 t/s sustained
Long context (15K) 20-30 t/s
Max context tested 26,624 tokens
VRAM usage (AWQ-4bit) ~24GB with MTP disabled

Disabling Multi-Token Prediction (MTP) saved 5GB of VRAM and fixed logic errors that dropped accuracy to 1% on complex tasks. The vLLM deployment required transformers installed from Git source and careful GPU memory management:

podman run -it --rm \
    --device nvidia.com/gpu=all \
    --ipc=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface:z \
    glm-4.7-custom \
    serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
    --tensor-parallel-size 2 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --max-model-len 26624 \
    --gpu-memory-utilization 0.94

The Competition: Where It Actually Stands

The community has been brutally honest about comparisons. Against Nemotron-30B, GLM-4.7-Flash “crushes it” for most use cases, but Nemotron still wins for “data analysis, scientific problem solving, technical concept adherence, and disparate concept synthesis.” For pure coding and tool use, GLM-4.7-Flash is the clear winner.

The Qwen3 comparison is more nuanced. Qwen3-30B-A3B scores 66% on LCB v6 vs GLM-4.7’s 64%, but GLM wins on SWE-bench (59.2% vs 22%) and τ²-Bench (79.5% vs 49%). The difference? GLM’s architecture seems optimized for sustained agentic workflows, while Qwen3 focuses on single-turn reasoning.

The Bottom Line

GLM-4.7-Flash represents something rare: a model that delivers on its core promise. It actually works for local agentic development, which is revolutionary in a space where “agentic” usually means “works for five minutes then hallucinates a tool call that crashes your system.”

But the deployment friction is a reality check. You need:
– Latest llama.cpp from source (post-PR #18936)
– BF16 format for stability (71GB+)
– Flash Attention disabled
– Specific sampling parameters to avoid loops
– Patience for quantization issues

For developers willing to navigate these quirks, the reward is a local agent that can handle complex, multi-file software engineering tasks at speeds that won’t make you want to switch back to cloud APIs. For the ecosystem, it’s proof that community-driven support can beat official implementations, if you’re willing to endure the growing pains.

The model’s success also raises questions about MoE efficiency. With only 3 billion active parameters achieving this performance, it makes dense models look wasteful. But as one developer noted, “these super sparse small MoE’s are just mildly useful idiots” compared to the depth of dense models for specific tasks. GLM-4.7-Flash might be the exception that proves the rule.

For now, the verdict is clear: if you need a reliable local agent and can spare the VRAM, GLM-4.7-Flash is worth the setup hassle. Just don’t expect it to be plug-and-play. The community has done the heavy lifting, but the finish line is still a moving target.

Related Articles