Tagged with

5 articles found

Blackwell’s 99KB Cage: How One Developer Jailbroke Qwen3.5 Performance with a 64-Line Kernel Patch

Technical deep dive into unlocking 2x inference speed on RTX PRO 6000 Blackwell GPUs by fixing CUTLASS SMEM overflow bugs for MoE models

#blackwell#cuda#CUTLASS...

cuda

GLM-4.7-Flash’s CUDA Fix: When Flash Attention Was the Problem, Not the Solution

A critical CUDA fix for GLM-4.7-Flash in llama.cpp reveals how a performance optimization was actively sabotaging local inference speeds, and why the community had to rebuild the wheel to make it work.

#cuda#Flash Attention#GLM-4.7...

cuda

Vulkan Is Quietly Outpacing CUDA for Specific LLMs on Consumer GPUs

Benchmarks reveal Vulkan achieving up to 2.2× speedup over CUDA for select quantized models on RTX 3080, challenging assumptions about optimal local inference backends.

#cuda#gpu-acceleration#llama.cpp...

Academic-Research

DGX Spark: The Overpriced ‘DevBox’ That’s Quietly Reshaping AI Research

How NVIDIA’s $4,000 mini-supercomputer is sparking controversy by giving small academic labs a fighting chance against Big Tech’s GPU empires, while potentially locking them into CUDA forever.

#Academic-Research#cuda#DGX-Spark...

cuda

llama.cpp’s Qwen3 Integration Pits Local AI Against the Cloud Giants

After months of development, Qwen3-Next is finally coming to llama.cpp with optimized CUDA operations, enabling fast local inference on consumer NVIDIA hardware.

#cuda#llamacpp#local-ai...