Tagged with

3 articles found

Llama.cpp’s MTP Merge Tanks Throughput on Constrained VRAM. Here’s How a Community Fork Pushes 110 tok/s on a 12GB Card.

After llama.cpp’s MTP merge caused a 20% performance regression, ik_llama.cpp brings back 110 tok/s for local Qwen3.6 inference on constrained VRAM.

#ik_llama.cpp#MTP#qwen3.6...

ik_llama.cpp

72.9 tok/s on 24GB VRAM: How ik_llama.cpp Won the Qwen 3.6 27B Backend War

A detailed technical comparison of llama.cpp, ik_llama.cpp, BeeLlama, and vLLM for running Qwen 3.6 27B on 24GB VRAM, achieving up to 72.9 tok/s decode with specific quantizations.

#ik_llama.cpp#LLM Inference#Local LLM...

ik_llama.cpp

The Fork That Finally Forked Back: llama.cpp Adopts ik_llama’s Secret Quantization Sauce

A controversial PR ports advanced IQ*_K quantization methods from the ik_llama.cpp fork into mainline llama.cpp, promising smaller models and better edge performance, but not without drama over code ownership and MIT license politics.

#ik_llama.cpp#llama.cpp#model-compression...