Tagged with

5 articles found

NVFP4 Is Not What You Think: NVIDIA’s Qwen3.6-27B Quantization Actually Beats FP8

NVIDIA’s Qwen3.6-27B-NVFP4 squeezes a 27B model into 22GB while matching, and sometimes beating, FP8 accuracy. Here’s how the quantization magic works and why it matters for local LLM deployment.

#blackwell#Local LLM#NVFP4...

ik_llama.cpp

Llama.cpp’s MTP Merge Tanks Throughput on Constrained VRAM. Here’s How a Community Fork Pushes 110 tok/s on a 12GB Card.

After llama.cpp’s MTP merge caused a 20% performance regression, ik_llama.cpp brings back 110 tok/s for local Qwen3.6 inference on constrained VRAM.

#ik_llama.cpp#MTP#qwen3.6...

abliteration

Abliteration Autopsy: 85 GPU-Hours of Forensics Reveal Which Safety Removal Actually Works

An open-source toolkit compared five abliteration methods on Qwen3.6-27B. The data exposes which techniques preserve capability, which destroy it, and why one popular method is built on stolen code.

#abliteration#LLM Safety#model alignment...

Inference Optimization

Multi-Token Prediction Lands in llama.cpp: Nearly 2× Faster Generation, but Prompt Processing Is Paying the Price

MTP support is now in llama.cpp mainline, delivering up to 71% faster token generation for local models. We break down the benchmarks, the prompt processing trade-offs, and how to actually enable it.

#Inference Optimization#Local LLM#MTP...

AI Architecture

The 300-Agent Reality Check: Why Cloud-First AI Architectures Are Collapsing

Kimi K2.6 and Qwen3.6 are rewriting the rules of AI infrastructure. Here’s why your API-dependent stack can’t handle 4,000 coordinated agent steps, and what to build instead.

#AI Architecture#Kimi K2.6#local LLMs...