The ‘Q4_K_M’ Illusion: Why KL Divergence and Perplexity Are Your Only Friends in the GGUF Wild West

For years, we’ve been playing the quantization lottery. You download a Q4_K_M file from your favorite quantizer, fire up llama.cpp, and pray the output doesn’t look like a Markov chain trained on Lorem Ipsum. The filename promises consistency, surely all Q4_K_M quants are created equal? Wrong. The community just got a brutal reminder that quantization is part science, part dark art, and entirely too important to leave to vibes.

The wake-up call came from a meticulous analysis of Qwen3.5-35B-A3B Q4 variants, where the same quantization label concealed performance swings that would make a crypto day trader nervous. We’re not talking about marginal differences. We’re talking about KL Divergence scores ranging from 0.0102 to 0.0524, a 5x spread in faithfulness to the original model. That’s not a rounding error, that’s a different model wearing the same name tag.

The Metrics That Matter: KLD vs. PPL

Before diving into the drama, let’s get technical. KL Divergence (KLD) measures how much your quantized model’s probability distribution has drifted from the baseline. Think of it as a GPS tracking how far your model has wandered off the original path. Lower is better, and it’s dataset-agnostic, it compares distributions directly, not outputs. Perplexity (PPL), meanwhile, measures the model’s uncertainty when predicting the next token. It’s derived from cross-entropy and gives you a sense of overall information loss.

Here’s the critical distinction: PPL can be gamed by luck. A quantized model might stumble into slightly better predictions on a specific test set by pure chance, showing artificially low perplexity. KLD doesn’t play that game. It measures the structural fidelity of your quantization, making it the more reliable metric for comparing recipes. As the benchmark author notes, “Since we are trying to see how much information we’ve lost and since PPL is noisy as it can get a better score by pure luck, KLD is better as it is not relying on the dataset but on the baseline.”

The relationship between them reveals deeper insights. High PPL with low KLD suggests the model is fundamentally sound but might need fine-tuning. High KLD with low PPL? You’ve got a distribution mismatch that’s probably going to bite you in production.

The Qwen3.5 Bloodwork: A Tale of Two Q4_K_Ms

The benchmarking data tells a story that should make any serious practitioner uncomfortable. Let’s look at the actual numbers from the Q4 sweep:

Quantization	Size (GiB)	PPL Score	KLD Score
AesSedai_Qwen3.5-35B-A3B-Q4_K_M	20.62	6.436887	0.010214
bartowski_Qwen3.5-35B-A3B-Q4_K_M	19.77	6.491274	0.018878
unsloth_Qwen3.5-35B-A3B-Q4_K_M	19.75	6.518045	0.023362
lmstudio_Qwen3.5-35B-A3B-Q4_K_M	19.71	6.543927	0.032892

Same label. Vastly different DNA. The AesSedai variant is nearly 3x more faithful than the lmstudio version, despite being barely a gigabyte larger. This isn’t just academic, when you’re running a 35B parameter MoE, that fidelity gap translates directly into output quality, reasoning consistency, and fewer of those maddening “wait, that doesn’t make sense” moments.

The secret sauce? Tensor protection strategies. AesSedai’s recipe consistently protects always-active tensors (attention mechanisms, shared experts) at Q8_0 precision while differentiating between ffn_down_exps and ffn_gate/up_exps. It’s surgical. Meanwhile, some recipes treat all tensors as equal, bulldozing critical pathways with aggressive quantization that looks good on file size but murders performance.

The XL Debacle: When Dynamic Goes Wrong

If the variance between Q4_K_Ms was a yellow flag, the Unsloth UD-Q4_K_XL saga was a five-alarm fire. Marketed as a “dynamic” quant promising the best of both worlds, it instead delivered a masterclass in how not to apply MXFP4.

The recipe mistakenly injected MXFP4 quantization into attention tensors and expert pathways where it absolutely doesn’t belong. MXFP4 is a specialist tool, best suited for routed experts (ffn_(gate|down|up)_exps) during quantization-aware training. Applying it post-hoc to a BF16 model, especially in attention layers, is like using a sledgehammer for brain surgery.

The result? A KLD score of 0.0524, the worst in the entire sweep despite not even being the smallest file. That’s a 5x degradation compared to the best Q4_K_M. The community quickly spotted the anomaly, with perplexity scores jumping nearly 0.2 points higher than competitors. The Hugging Face discussion thread reads like a post-mortem: “I recently switched to using MXFP4, but as you noted my script most likely had some issues somewhere“, admitted the quantizer.

This is precisely why we need standardized benchmarking. Without KLD and PPL data, users would have downloaded UD-Q4_K_XL, seen the “XL” label, assumed superiority, and spent weeks wondering why their model suddenly developed a stutter. Instead, the data exposed the flaw within hours.

Running Your Own Bloodwork: The Commands

Enough theory. Here’s how to actually measure this stuff yourself, because relying on community benchmarks is a stopgap, not a solution.

First, generate baseline logits from your BF16 model:

llama-perplexity -m <bf16_model> -f wiki.test.raw --kl-divergence-base <file_name> [other parameters]

Then measure your quantized candidate:

llama-perplexity -m <quantized_model> --kl-divergence-base <file_name> --kl-divergence [other parameters]

The benchmark used wikitext2_test.txt at 512 token context with -ncmoe 22 and -ngl 999. But here’s a pro tip from the discussion: imatrix contamination is real. If your quantizer used wikitext2 in their calibration data (and many do), you’re measuring on training data, which skews results. For true validation, construct a fresh corpus from recent sources, podcast transcripts, recent news, anything the model wasn’t explicitly calibrated on. One commenter offered to share a “secret corpus” for exactly this purpose.

The Efficiency Score: Size vs. Fidelity

File size isn’t everything, but it’s not nothing either. The benchmark introduced an “Efficiency Score” that calculates the distance to a theoretical perfect model (zero size, zero KLD). The formula: √(Normalized Size² + Normalized KLD²). Lower is better.

The top performers reveal interesting tradeoffs:

Rank	Quantization	Size (GiB)	KLD Score	Eff. Score
1	AesSedai_IQ4_XS	16.40	0.024036	0.327342
6	bartowski_Q4_K_S	19.04	0.021415	0.679213
15	AesSedai_Q4_K_M	20.62	0.010214	1.000000

Notice how IQ4_XS dominates the efficiency ranking despite higher KLD than Q4_K_M. It’s 4GB smaller, enough to fit another small model in memory alongside it. This is the fundamental tension: absolute fidelity versus operational efficiency. For a chatbot on a 24GB GPU, that 4GB savings might be the difference between loading a 4-bit model with full context or hitting swap.

The reference implementation used AesSedai Q4_K_M as the baseline (score = 1.0). Anything below 1.0 offers a better tradeoff, anything above sacrifices too much for the size reduction. This gives you a quantitative framework for decisions that used to be pure guesswork.

The Community Dynamics: Transparency vs. Entitlement

Some users demand quantizers publish KLD scores by default, treating it as a basic quality assurance step. Others push back, reminding everyone that quantizers are volunteers doing “us a favor in the first place.”

The reality is more nuanced. Yes, quantizers are heroes. But as extreme quantization techniques preserving model performance become more complex, the gap between “works” and “works well” widens. A simple Q4_0 from 2022 is straightforward. A dynamic MoE quant with tensor-specific recipes in 2026 is a research contribution that demands validation.

The community is essentially saying: We love your work, but we can’t trust it without data. And they’re right. When advanced quantization methods in llama.cpp get adopted upstream, the complexity increases exponentially. What used to be a simple file size vs. quality tradeoff is now a multi-dimensional optimization problem involving tensor types, calibration datasets, and architecture-specific heuristics.

Hardware Realities: Speed vs. Quality

The benchmarking focused on fidelity, but production deployments must also consider throughput. One user contributed speed benchmarks on AMD Radeon via Vulkan, revealing that the “best” quant for quality might be middling for performance:

Quant Type	PP TOPS	TG TOPS
iq2_xs	1.52	0.22
iq4_nl	1.50	0.13
mxfp4	1.47	0.13
q4_K	1.40	0.11

The pattern is clear: more aggressive quantization generally means faster token generation, but at a fidelity cost. The MXFP4 format shows decent speed but trails in quality when misapplied. For efficient MoE architectures for consumer devices, this tradeoff is existential, you’re already juggling sparse activation patterns and memory bandwidth constraints.

This is why evaluating models before investing in hardware is critical. You might benchmark on an RTX 4090 with 24GB, deploy on a laptop with 16GB, and discover your “fast” quant is now thrashing swap. The fidelity-speed-size triangle has no universal optimum, only context-specific sweet spots.

The MXFP4 Controversy: Right Tool, Wrong Job

Let’s address the elephant in the room. MXFP4 isn’t inherently bad. In fact, when used correctly, during quantization-aware training for routed experts, it’s a powerful tool for model efficiency beyond parameter size. The problem is its misuse in post-training quantization of attention weights.

The Unsloth team acknowledged the issue: “I recently switched to using MXFP4, but as you noted my script most likely had some issues somewhere.” The bug propagated across multiple models, affecting not just Qwen3.5 but also Qwen3 Coder Next and the 122B-A10B variant. For the 122B model, users reported “garbled text or repetition”, symptoms of catastrophic distribution drift.

This highlights a systemic risk: as quantization recipes become more “intelligent” and dynamic, the potential for subtle bugs increases. A static Q4_0 is hard to screw up. A recipe that conditionally applies different quant types to different tensor classes based on activation patterns? That’s a Python script away from disaster.

The Path Forward: Standardization or Chaos

The community is at an inflection point. The current system, where each quantizer maintains their own recipes, calibration data, and quality standards, doesn’t scale. Users are left comparing apples to oranges, and as this benchmark shows, sometimes the orange is actually a potato.

Several paths forward emerge:

Mandatory Metadata: Quantizers could embed KLD/PPL scores directly in GGUF metadata or READMEs. This requires community buy-in and tooling support from platforms like Hugging Face.
Standardized Benchmark Suites: A community-maintained evaluation harness using fresh corpora, run on diverse hardware. This prevents imatrix contamination and provides apples-to-apples comparisons.
Federated Testing: Users run benchmarks on their own hardware and contribute results to a central database. This scales better than relying on a few volunteers with RTX 3060s.
Upstream Integration: As GGML joins Hugging Face, first-party quantizations could come with guaranteed quality metrics and CI validation.

The last option is most promising. When the maintainers of the format also produce the quantizations, the feedback loop tightens. Bugs like the MXFP4 injection get caught before release, not after community outcry.

Practical Takeaways: What You Should Do Today

Stop trusting filenames blindly. A Q4_K_M from 2024 is not a Q4_K_M from 2026. Recipes evolve, and not always for the better.
Run your own KLD benchmarks on models you depend on. It’s slow, expect hours for a full sweep on consumer hardware, but it’s the only way to know for sure. Use fresh data, not wikitext2 if you suspect contamination.
Prioritize KLD over PPL for recipe comparison. Use PPL as a sanity check, but trust KLD for structural fidelity.
Size isn’t everything. That 16GB IQ4_XS might serve you better than a 20GB Q4_K_M if you’re memory-constrained. Check the efficiency scores.
Watch for MXFP4 in attention layers. If you’re using Unsloth’s dynamic quants from early 2026, verify tensor types. The issue is acknowledged and being fixed, but old files linger.
Contribute back. If you have the hardware, run benchmarks and share them. The community needs more data, not more opinions.

The Bottom Line

Quantization fidelity benchmarking with KLD and PPL isn’t just a nice-to-have, it’s becoming a requirement for serious local LLM deployment. As practical limits of quantized models on low-end hardware become more apparent, and as model architectures grow more complex, the gap between “it runs” and “it runs well” widens.

The Qwen3.5 analysis proves that we can no longer afford to treat quantization as a solved problem. Every new model architecture, every novel quant type, every “dynamic” recipe introduces unknowns that only rigorous measurement can resolve. The “vibes-based” era is over. Welcome to the bloodwork era.

Your move, quantizers. Show us the numbers.