The $20K Paperweight: How NVIDIA’s CUTLASS Bugs Turned Blackwell Workstations into Expensive Heaters

You just dropped twenty grand on NVIDIA’s flagship RTX PRO 6000 Blackwell Workstation Edition. You’ve got 96GB of GDDR7, PCIe Gen5, and the promise of blazing-fast local inference for massive MoE models. You fire up Qwen3.5-397B, expecting to chew through tokens at 130+ per second, and instead you hit a wall at 50.5 tok/s.
Not because of thermal throttling. Not because you forgot to enable CUDA graphs. Because NVIDIA’s own CUTLASS library, the foundational math kernels they ship with every toolkit, literally crashes when trying to use the FP4 tensor cores they put in the silicon.
This isn’t a performance optimization issue. It’s a practical crack in NVIDIA’s AI hardware strategy that’s leaving early adopters stranded between marketing promises and silicon reality.
The 60% Performance Gap That Wasn’t Supposed to Exist
The controversy erupted when independent tester Brandon Music (handle: lawdawgattorney) published an exhaustive 8-hour benchmark analysis of Qwen3.5-397B NVFP4 on a 4x RTX PRO 6000 setup. The findings were brutal: 50.5 tok/s sustained decode is the hard ceiling, not the 130-150 tok/s figures floating around community forums.
The culprit? Every single one of NVIDIA’s TMA Warp Specialized grouped GEMM tactics, 80 in total, fails to initialize on SM120 hardware with the error:
Failed to initialize cutlass TMA WS grouped gemm.
Error: Error Internal (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)
This isn’t a configuration error. This is NVIDIA’s own kernel library rejecting its own hardware. The RTX PRO 6000 has FP4 tensor cores specifically designed for NVFP4 quantization. NVIDIA ships NVFP4-quantized models designed to use them. But the CUTLASS kernels that should light them up for MoE inference? They crash, forcing a fallback to Marlin W4A16 dequantization that leaves roughly half the theoretical throughput on the table.
The SMEM Bomb: Why Workstation Blackwell Isn’t Datacenter Blackwell
Here’s where it gets technically spicy. SM120 (RTX PRO 6000, RTX 5090) and SM100 (datacenter B200) share the “Blackwell” name but diverge critically in shared memory architecture. Datacenter Blackwell dies have ~227 KiB of SMEM per SM. Workstation SM120? A strict 101 KiB physical limit.
The CUTLASS grouped GEMM uses StageCountAutoCarveout to calculate pipeline stages for overlapping memory loads with compute. The bug is simple but devastating: the auto-computation formula fails to account for alignas(1024) padding on shared memory tensors. It calculates that 3 pipeline stages will fit, but the padding pushes the SharedStorage footprint over 101,376 bytes. When the driver attempts cudaFuncSetAttribute, it rejects the TMA descriptor and throws kErrorInternal.
The smoking gun? SM121 (DGX Spark) works perfectly at 356 TFLOPS for NVFP4 MoE. Same architecture generation, different validation path. NVIDIA validated the datacenter tile configs but left workstation SM120 to rot.
When The “Fix” Makes It Worse: The MTP Regression
Multi-Token Prediction (MTP) should theoretically boost throughput by drafting future tokens. On SM120 with the Marlin fallback, it’s a -22% regression (50.5 tok/s down to 39.6 tok/s).
The reason exposes another layer of the NVFP4 brokenness. MTP draft heads were trained on native FP4 activations. Marlin uses W4A16 dequantization, which produces slightly different activation values. Result: 61-85% token acceptance rates instead of the expected 89%. The overhead of speculating and rejecting outweighs any benefit, turning a performance feature into a liability.
The CUDA 13.0 Divide: compute_120f vs compute_120a
The community has partially hacked around this mess, but the solution reveals NVIDIA’s versioning chaos. Using CUDA 13.0 with compute_120f (the “f” suffix for full feature set) instead of compute_120a yields 39.0 tok/s with correct native FP4 output, compared to 14.6 tok/s under compute_120a.
But compute_120f requires CUDA 13.0+, while most stable distributions still ship CUDA 12.8. This creates a bifurcated ecosystem where your inference speed depends on whether you’re willing to run bleeding-edge nightlies. As one developer noted in historical cases of driver updates impacting hardware, NVIDIA’s driver transitions tend to leave recent hardware in limbo until the ecosystem catches up.
The patches required to get even this far were extensive: 12 patches across FlashInfer and vLLM, modifying SM version checks, CuTe DSL architecture lists, and capability mappings. Brandon Music submitted these upstream (FlashInfer PR #2725, vLLM PR #36453), yet the CUTLASS issue #3096 filed with NVIDIA remains unanswered.
The Economics of Broken Software
This isn’t just a technical inconvenience, it’s a financial hit to local AI labs. The RTX PRO 6000 is positioned as a $20,000 professional AI GPU. NVIDIA sells NVFP4 models for it. The inference path they designed for it doesn’t work on it. That’s not a software limitation, it’s a product defect.
The workaround requires forcing Marlin and disabling MTP:
export VLLM_MOE_FORCE_MARLIN=1
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
--tensor-parallel-size 4 \
--moe-backend marlin \
--max-model-len 262144
This gets you 50.5 tok/s, impressive for a 397B parameter model, faster than most Llama 70B setups, but roughly half of what the hardware should deliver. For single-user workloads it’s usable. For anyone who bought four of these cards expecting to rival datacenter throughput, it’s a bitter pill that raises questions about the economic viability of current AI infrastructure spending.
The Broader Pattern: When Hardware Outpaces Software
This controversy fits a larger narrative of NVIDIA’s software ecosystem struggling to keep pace with its hardware releases. While open-source optimization frameworks like Unsloth scramble to support new architectures, NVIDIA’s own libraries lag behind, leaving users to debug template metaprogramming errors in CUTLASS just to get advertised features working.
The SM120 debacle also highlights the growing divergence between NVIDIA’s datacenter and consumer/workstation product lines. SM100 gets validated, optimized, and supported. SM120 gets broken kernels and radio silence. It’s a reminder that in NVIDIA’s ecosystem, not all Blackwells are created equal, and workstation users are increasingly second-class citizens in a world moving toward alternative high-performance computing architectures.
For now, if you’re running a 4x RTX PRO 6000 setup, you’re stuck with Marlin and 50 tok/s. The native FP4 performance is there in the silicon, trapped behind a software bug that NVIDIA hasn’t acknowledged, let alone fixed. Your $20,000 GPU is effectively running in compatibility mode, waiting for a kernel patch that may never come.
The Verdict: Until NVIDIA fixes the SMEM overflow in StageCountAutoCarveout and validates smaller tile shapes (128x64x128, 256x128x64) for SM120, these workstation Blackwell cards are expensive paperweights for anyone expecting to run NVFP4 MoE models at advertised speeds. The hardware is capable of 2-3x the performance, but competitive alternatives to the NVIDIA ecosystem are starting to look more attractive when the market leader can’t ship working math kernels for its own flagship products.




