
You just dropped twenty grand on NVIDIA’s flagship RTX PRO 6000 Blackwell workstation, fired up Qwen3.5-397B, and watched it crawl at 55 tokens per second. Meanwhile, B200 datacenter benchmarks tease 300+ tok/s on the same model. The gap isn’t just marketing fluff, it’s a hard architectural wall that NVIDIA never bothered to mention in the spec sheet.
The culprit? A 129KB difference in shared memory that breaks CUTLASS at the compiler level, leaving workstation GPUs choking on MoE models while known CUTLASS instability issues plague the ecosystem.
The 99KB Wall Nobody Talked About
Datacenter Blackwell (B200/B300) ships with 228KB of shared memory per SM. Workstation Blackwell (RTX PRO 6000, RTX 5090, DGX Spark) gets 99KB. This isn’t a “soft” limit or a driver quirk, it’s a physical hardware constraint that breaks the default CUTLASS tile shapes for block-scaled GEMM operations.
Failed to initialize cutlass TMA WS grouped gemm
The K=128 tiles that CUTLASS generates for block-scaled operations require more than 99KB of SMEM. They compile fine, then overflow at runtime. The fallback? Slower generic kernels that leave 50-60% of your tensor cores idle.
The K=64 Bug Hunt
The obvious fix, using K=64 tiles, should fit comfortably within 99KB. Except CUTLASS had a hardcoded assumption in the TMA scale factor layout that K ≥ 128.
Specifically, the scale factor block size (Blk_SF=4) assumes four scale factors along the K dimension, but K=64 only provides two. This creates a layout mismatch that fails during fill_tma_gmem_shape_stride with a uint64_t conversion error on nested tuples.
The issue sits in sm120_blockscaled_mma_builder.inl. The original code assumes scale factors align with the full block size, but SM120’s constrained SMEM requires folding excess scale factors into the basic block when K/SFVectorSize < Blk_SF.
The Patch: Folding Scale Factors

- Effective Block Size Calculation: Compute
EffBlk_SF = min(K/SFVectorSize, Blk_SF)to clamp the scale factor block when K<128 - Basic Block Folding: When
EffBlk_SF > MMA_NSF, fold the excess scale factors intokBasicBlockto keep TMA layouts flat and avoid the nested tuple conversion that crashes the compiler
This enables three new tile shapes for SM120: [128,128,64], [128,256,64], and [256,128,64]. The autotuner can now select K=64 tiles that fit within the 99KB budget, unlocking the hardware’s actual throughput.
The changes propagate through FlashInfer’s kernel generation pipeline (generate_kernels.py), which filters CTA shapes for SM120 compatibility, and the dispatch headers (moe_gemm_template_dispatch_tma_ws.h) that route operations to the correct tile implementations.
Benchmarks vs. Reality
On 4x RTX PRO 6000 Blackwell (96GB GDDR7 each) running Qwen3.5-397B-A17B-NVFP4 with TP=4 and MTP=5, the results look dramatic:
| Users | Before (tok/s) | After (tok/s) | Improvement |
|---|---|---|---|
| 1 | 142 | 283 | +99% |
| 4 | 250 | 850 | +240% |
| 8 | 510 | 1,283 | +151% |
But there’s a catch. That 283 tok/s figure comes with thinking mode enabled on trivial prompts where the model generates <think> tags with near-100% MTP acceptance. In real-world usage, essays, code generation, detailed explanations with thinking disabled, single-user throughput settles around 130-136 tok/s.
The K=64 patch still delivers a genuine 20-25% improvement over pre-patch baselines on identical hardware. More importantly, it unblocks multi-user scenarios where the system throughput scales to 1,624 tok/s at 16 concurrent users, territory that was previously impossible due to the SMEM bottleneck.
| Scenario | 1-User tok/s | Notes |
|---|---|---|
| Short prompt, thinking ON | 283 | MTP inflated by trivial think tokens |
| Real prompt, thinking OFF | ~130-136 | Actual usable throughput |
| Pre-patch baseline | ~110 | Same hardware, no K=64 fix |
Context length scaling shows the expected degradation as KV cache pressure mounts:
| Input Context | tok/s |
|---|---|
| ~128 tokens | 283 |
| 1K | 277 |
| 4K | 247 |
| 16K | 183 |
| 32K | 141 |
Deployment: Docker and Threadripper Quirks
For those looking to replicate this without compiling CUTLASS from source, a pre-built Docker image packages the fix:
docker pull verdictai/vllm-blackwell-k64:latest
docker run -d --name vllm --gpus all --ipc host --shm-size 32g \
-p 9200:8000 \
-v /path/to/sehyo-qwen35-nvfp4:/model:ro \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
verdictai/vllm-blackwell-k64:latest \
python3 -m vllm.entrypoints.openai.api_server \
--model /model --served-model-name qwen3.5-397b-nvfp4 \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--tensor-parallel-size 4 --gpu-memory-utilization 0.85 \
--max-model-len 262144 --enable-prefix-caching \
--reasoning-parser qwen3 --enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":5}'
Threadripper users face additional hardware interconnect limitations specific to AMD platforms. The AMD-Vi IOMMU causes page faults with GPU P2P transfers, requiring either NCCL_P2P_DISABLE=1 or kernel parameter iommu=pt (pass-through mode). Some motherboards require both.
Other critical environment variables for SM120 stability:
– OMP_NUM_THREADS=6 (not 24, avoids oversubscription with TP=4)
– CUDA_DEVICE_MAX_CONNECTIONS=32
– PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
– Driver 595+ from the NVIDIA CUDA repo (sudo apt install nvidia-open)
Why This Matters for Workstation AI
NVIDIA’s decision to ship workstation Blackwell with less than half the SMEM of datacenter parts isn’t just a product segmentation strategy, it’s a fundamental compatibility break. The NVIDIA-supported optimization tools ecosystem largely assumes datacenter memory budgets, leaving workstation users to discover these constraints through cryptic CUTLASS errors and failed autotuning.
The FlashInfer PR #2786 (and related CUTLASS issue #3096) represents more than a performance tweak, it’s a necessary compatibility layer for anyone running quantized MoE models on consumer or workstation Blackwell hardware. Without it, models like Qwen3.5-397B and DeepSeek-V3 remain effectively broken on RTX PRO 6000 and RTX 5090 GPUs, regardless of how much VRAM you have.
For developers building local AI infrastructure, this patch validates a critical assumption: workstation hardware can achieve datacenter-class throughput on modern MoE architectures, provided you’re willing to rebuild the kernel stack around the 99KB constraint. The 20-25% raw performance gain is nice, but the real win is functional compatibility, finally being able to use the hardware you paid for without falling back to CPU-offloaded fallback kernels.
The upstream fix is currently under review. Until it lands in mainline FlashInfer, the custom K=64 kernel remains the only path to unlock Blackwell’s full potential on workstation GPUs.




