CUDA bug t

GLM-4.7-Flash’s CUDA Fix: When Flash Attention Was the Problem, Not the Solution

A critical CUDA fix for GLM-4.7-Flash in llama.cpp reveals how a performance optimization was actively sabotaging local inference speeds, and why the community had to rebuild the wheel to make it work.

by Andre Banandre

For the past week, GLM-4.7-Flash users have been living a paradox: enabling Flash Attention, the same optimization that makes models like Qwen3 scream at 3154 tokens per second, was turning their local inference into a CPU-bound crawl. The culprit wasn’t a misconfigured server or insufficient VRAM. It was a missing four-line code path in llama.cpp’s CUDA backend that treated GLM-4.7’s attention pattern as an alien artifact.

The fix, merged in PR #18953, exposes a dirty secret of local LLM deployment: sometimes the performance regression isn’t in the model architecture, it’s in the infrastructure’s assumptions about what constitutes “normal” tensor dimensions.

The GQA Ratio That Broke CUDA

GLM-4.7-Flash uses a custom variant of DeepSeek’s Multi-head Latent Attention (MLA) with a Grouped Query Attention ratio of 20. For context, most models llama.cpp supports use GQA ratios of 1, 2, 4, 8, or 16. When Flash Attention was enabled with -fa 1, the CUDA kernel would encounter tensor dimensions it couldn’t optimize, specifically ncols2 = 20, and fall back to a slow path that hammered the CPU with memory transfers.

As one developer noted in Issue #18944, the unsupported dimensions caused Flash Attention to run at half speed on Pascal GPUs and triggered a 3x slowdown on modern hardware compared to running without FA:

# Before the fix: -fa on was catastrophic
llama-bench -m GLM-4.7-FLASH-Q4_0.gguf -ngl 99 -fa 0 -p 8192  # 2011 t/s
llama-bench -m GLM-4.7-FLASH-Q4_0.gguf -ngl 99 -fa 1 -p 8192  # 2328 t/s (but with CPU thrashing)

The real damage wasn’t visible in synthetic benchmarks. Users reported single CPU cores pinned at 100% while their RTX 4090s sat idle, and generation speeds plummeted from 160 tokens/second to 60. As one Reddit user put it: “It’s not just abnormally high CPU usage, the prefill and decode speeds are incredibly low, with long texts experiencing a slowdown never seen before.”

The Two-Line Fix That Unlocked 3B Active Parameters

The solution was deceptively simple: add support for gqa_ratio = 4 in the Flash Attention kernel’s dispatch logic. While GLM-4.7’s ratio is technically 20, the ncols2 parameter can be set to any power of 2 that cleanly divides the GQA ratio. The PR added the missing case, allowing the kernel to use an optimized MMA (Matrix Multiply-Accumulate) path instead of falling back to generic CPU code.

Performance impact was immediate and brutal, in the best way:

# After the fix: parity restored
llama-bench -m GLM-4.7-FLASH-Q4_K_XL.gguf -ngl 99 -fa 0 -p 2048  # 2588 t/s
llama-bench -m GLM-4.7-FLASH-Q4_K_XL.gguf -ngl 99 -fa 1 -p 2048  # 2665 t/s

The gap between Flash Attention on and off collapsed to under 3%. More importantly, CPU usage dropped from 100% on a single core to normal background levels, and the quantized KV cache, initially a victim of the same bug, began working correctly with -ctk q8_0 -ctv q8_0.

Quantized Cache: The Secondary Casualty

The initial fix only solved half the problem. Users quickly discovered that enabling quantized KV cache with Flash Attention still forced computations onto the CPU. This was tracked in a separate issue where q8_0 cache types would trigger the same slow path as the original GQA bug.

A subsequent commit (b70d251) resolved this, bringing quantized cache performance in line with f16:

# Quantized cache finally works at full speed
llama-bench -m GLM-4.7-FLASH-Q4_K_XL.gguf -ngl 99 -fa 1 -p 2048 -ctk q8_0 -ctv q8_0
# Result: 2675 t/s pp2048, 102 t/s tg128

This is critical for low-resource deployment. The Q4_K_XL quantization already shrinks GLM-4.7’s 30B parameters to 16.31 GiB, but the KV cache for a 32K context can consume another 4-8 GB. Using Q8_0 cache cuts that memory footprint in half, making the model viable on 24GB GPUs, the current sweet spot for local AI enthusiasts.

Real-World Deployment: From Broken to “Be Your Own Cloud”

The CUDA fix transforms GLM-4.7-Flash from a promising but frustrating model into a genuine alternative to cloud APIs. One developer’s comprehensive guide shows how to run Claude Code against a local GLM-4.7 instance using llama.cpp’s Anthropic-compatible API endpoint:

# Server command with the fixed CUDA paths
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
  --alias glm-4.7-flash \
  --jinja --ctx-size 32768 \
  --temp 0.7 --top-p 1.0 \
  --sleep-idle-seconds 300 \
  --host 0.0.0.0 --port 8080

The --sleep-idle-seconds flag is particularly important here. When running GLM-4.7 as a persistent service exposed via Cloudflare tunnels, you want GPU memory freed during idle periods. The fix ensures that when the model reloads after sleeping, it does so with full CUDA acceleration, not the crippled CPU fallback users suffered before.

Be your own cloud
Be your own cloud

This setup delivers 50-80 tokens/second on an RTX 3090 while using only 23GB/24GB VRAM at 45K context, as reported by early adopters. For local agentic workflows, this means you can run GLM-4.7-Flash as a drop-in replacement for Claude Haiku without the latency and cost of API calls.

The Controversy: Did We Need Flash Attention at All?

Here’s where the community splits. While the fix makes Flash Attention not broken, some benchmarks suggest it doesn’t actually help GLM-4.7. One user posted detailed llama-bench results showing that for this specific architecture, Flash Attention provides only marginal speedups on prefill (2328 vs 2011 t/s) and essentially none on token generation.

The debate centers on MLA’s inherent efficiency. Unlike standard multi-head attention, MLA already compresses the KV cache by 73%. The memory savings from Flash Attention’s tiling might be redundant when the cache is already a fraction of the size. As one maintainer noted: “For batch size 1, the tensor core utilization in the MMA kernel will only be 50%, so you should also check whether the tile kernel is faster.”

This raises a thorny question: did the community spend a week optimizing a kernel that GLM-4.7 barely benefits from? The answer depends on your batch size. For single-user interactive sessions, the difference is negligible. For multi-slot server deployments with continuous batching, the fixed Flash Attention path still reduces memory pressure and enables higher concurrency.

Edge Cases and Lingering Issues

The fix isn’t universal. Pascal GPU users (GTX 10-series) report that Flash Attention still runs at half the speed of the non-FA kernels on their hardware. The issue appears specific to compute capability 6.x and earlier, where the MMA kernel’s assumptions about tensor core availability don’t hold.

There’s also the Vulkan question. With AMD gaining ground in local AI via vLLM’s official Ryzen AI support, the lack of a Vulkan Flash Attention path for GLM-4.7 becomes more glaring. The CUDA fix doesn’t translate to ROCm or Vulkan backends, leaving AMD users with the CPU fallback they’ve always had.

And for those chasing extreme quantization, the 2-bit miracle that makes GLM-4.7 fit on 8GB GPUs remains incompatible with Flash Attention. The math for 2-bit dequantization can’t be fused with the attention kernel, forcing a slower split-k decomposition.

The Bottom Line: Local AI’s Infrastructure Debt

GLM-4.7-Flash’s CUDA fix is a microcosm of local AI’s growing pains. We’re running production-grade models on infrastructure built for a different era of transformer architecture. Every new model, especially ones using MLA, MoE, or exotic attention patterns, requires patches to the patchwork.

The silver lining? The fix came from the community in under 72 hours. Between the GitHub issue, the Reddit troubleshooting threads, and the Unsloth quantization pipeline, this was a distributed effort that didn’t wait for corporate prioritization. That’s the power of open weight models meeting open source infrastructure.

But it also means every promising new model is a potential landmine. The transparent reasoning capabilities that make GLM-4.7 special are exactly what break existing tooling. As local AI moves from hobbyist curiosity to production deployment, we’ll need more systematic approaches to architecture validation, lest every new release turns into a game of whack-a-mole with tensor dimensions.

 

Share:

Related Articles