GLM-4.7-Flash’s CUDA Fix: When Flash Attention Was the Problem, Not the Solution
A critical CUDA fix for GLM-4.7-Flash in llama.cpp reveals how a performance optimization was actively sabotaging local inference speeds, and why the community had to rebuild the wheel to make it work.