Tagged with

4 articles found

1000 Tokens Per Second on a 1T Model? Xiaomi Just Broke Physics (or At Least the Latency Barrier)

Xiaomi’s MiMo v2.5 hits 1000 TPS on a trillion-parameter model using commodity GPUs. Here’s the deep dive on the FP4 quantization, DFlash speculative decoding, and TileRT systems alchemy that made it possible.

#distributed systems#Inference Optimization#Mixture of Experts...

Inference Optimization

Multi-Token Prediction Lands in llama.cpp: Nearly 2× Faster Generation, but Prompt Processing Is Paying the Price

MTP support is now in llama.cpp mainline, delivering up to 71% faster token generation for local models. We break down the benchmarks, the prompt processing trade-offs, and how to actually enable it.

#Inference Optimization#Local LLM#MTP...

blackwell

Blackwell’s 99KB Cage: How One Developer Jailbroke Qwen3.5 Performance with a 64-Line Kernel Patch

Technical deep dive into unlocking 2x inference speed on RTX PRO 6000 Blackwell GPUs by fixing CUTLASS SMEM overflow bugs for MoE models

#blackwell#cuda#CUTLASS...

Fine-tuning

The Qwen Brain Drain: Why Alibaba’s Loss Is Your Local Inference Gain

Alibaba’s Qwen team is imploding just as they released their best models yet. Here’s how to exploit the chaos using Unsloth to fine-tune Qwen3.5 on consumer hardware.

#Fine-tuning#Inference Optimization#qwen...