BANANDRE
NO ONE CARES ABOUT CODE

Navigation

HomeCategories

Categories

Artificial Intelligence(619)
Software Architecture(314)
Software Development(293)
Data Engineering(174)
Engineering Management(88)
Enterprise Architecture(73)
Product Management(30)

Tagged with

#Speculative Decoding

4 articles found

1000 Tokens Per Second on a 1T Model? Xiaomi Just Broke Physics (or At Least the Latency Barrier)
distributed systems
Featured

1000 Tokens Per Second on a 1T Model? Xiaomi Just Broke Physics (or At Least the Latency Barrier)

Xiaomi’s MiMo v2.5 hits 1000 TPS on a trillion-parameter model using commodity GPUs. Here’s the deep dive on the FP4 quantization, DFlash speculative decoding, and TileRT systems alchemy that made it possible.

#distributed systems#Inference Optimization#Mixture of Experts...
Read More
Gemma 4 MTP Just Landed in llama.cpp,  And It’s Turning 12GB GPUs Into Speed Demons
gemma 4

Gemma 4 MTP Just Landed in llama.cpp, And It’s Turning 12GB GPUs Into Speed Demons

The merge of Gemma 4 MTP support into llama.cpp b9549 enables speculative decoding that doubles local inference speeds on consumer hardware. Real benchmarks from the community reveal surprising caveats.

#gemma 4#MTP#qat...
Read More
Llama.cpp’s MTP Beta Is Stealing vLLM’s Lunch
local AI

Llama.cpp’s MTP Beta Is Stealing vLLM’s Lunch

The new Medusa-style MTP support in llama.cpp beta isn’t just catching up, it threatens to rewrite the economics of local model serving.

#local AI#MTP#Speculative Decoding...
Read More
The Death of Cloud AI? Local 27B Models Rival Frontiers
artificial intelligence

The Death of Cloud AI? Local 27B Models Rival Frontiers

Qwen 3.6 27B on consumer hardware is disrupting the SaaS subscription model. Here’s how, and why it’s a warning sign for cloud AI.

#artificial intelligence#local AI#qwen...
Read More
BANANDRE
NO ONE CARES ABOUT CODE

Connect

2026 BANANDRE
Privacy PolicyTermsImpressum
Built with 🍌