Tagged with

6 articles found

Xiaomi’s 300B Model Just Got a Secret Speed Hack, DFlash Is the Real Deal

Xiaomi quietly dropped MiMo-V2.5-DFlash on Hugging Face. A 311B parameter model with block diffusion speculative decoding that could double your inference speed. Here’s what the community is finding.

#DFlash#LLM Inference#mimo...

AI Efficiency

DeepSeek DSpark: The 85% Speed Hack That Makes Your GPU Look Lazy

DeepSeek’s DSpark speculative decoding framework delivers 60-85% faster inference on V4 models. Here’s how it works, the real-world numbers, and why it matters for anyone serving LLMs.

#AI Efficiency#deepseek#dspark...

distributed systems

1000 Tokens Per Second on a 1T Model? Xiaomi Just Broke Physics (or At Least the Latency Barrier)

Xiaomi’s MiMo v2.5 hits 1000 TPS on a trillion-parameter model using commodity GPUs. Here’s the deep dive on the FP4 quantization, DFlash speculative decoding, and TileRT systems alchemy that made it possible.

#distributed systems#Inference Optimization#Mixture of Experts...

gemma 4

Gemma 4 MTP Just Landed in llama.cpp, And It’s Turning 12GB GPUs Into Speed Demons

The merge of Gemma 4 MTP support into llama.cpp b9549 enables speculative decoding that doubles local inference speeds on consumer hardware. Real benchmarks from the community reveal surprising caveats.

#gemma 4#MTP#qat...

local AI

Llama.cpp’s MTP Beta Is Stealing vLLM’s Lunch

The new Medusa-style MTP support in llama.cpp beta isn’t just catching up, it threatens to rewrite the economics of local model serving.

#local AI#MTP#Speculative Decoding...

artificial intelligence

The Death of Cloud AI? Local 27B Models Rival Frontiers

Qwen 3.6 27B on consumer hardware is disrupting the SaaS subscription model. Here’s how, and why it’s a warning sign for cloud AI.

#artificial intelligence#local AI#qwen...