The Uncensored Model Wars: Forensic Analysis of Abliterated Weights

When a researcher gets banned from a Discord server for publishing benchmark results, you know the stakes are higher than usual. NathanDreamFast didn’t set out to start a war. He just wanted to verify a claim that sounded too good to be true: that HauhauCS’s “aggressive” abliteration technique produces “the best lossless uncensored models out there” with “no changes to datasets or capabilities.”

What followed was a week-long forensic deep-dive into the weights themselves, converting GGUFs back to safetensors, running HarmBench evaluations, calculating KL divergence, and mapping edit vectors across tensor subspaces. The results? Abliteration isn’t magic, and “lossless” is a marketing term that crumbles under statistical scrutiny.

The Forensic Setup: Reverse Engineering the Black Box

The investigation targeted five models: Qwen3.5-2B, 3.5-4B, 3.5-9B, 3.5-27B, and Qwen3-4B-Instruct-2507. The first four use Qwen’s hybrid Mamba2+Transformer architecture (24-64 layers mixing state-space models with standard attention), while the Qwen3-4B serves as a pure Transformer control. Three abliteration techniques went under the microscope: Heretic (the surgical pioneer), HauhauCS Aggressive (the broad modifier), and Huihui (the inconsistent contender).

Methodology was rigorous: lm-evaluation-harness for capability retention, HarmBench for safety evaluation, and full-vocab KL divergence on first-token logits to measure distributional drift. Hardware was serious, RTX 5090s and 4090s running vLLM at bfloat16.

The norm-preserving ablation unlocking restricted models debate has been raging for months, but this is the first time we’ve seen surgical forensics applied to the weights themselves rather than just prompt engineering.

The 4B Catastrophe: When Abliteration Destroys the Model

If you want to see how badly abliteration can go wrong, look no further than the Qwen3.5-4B results. Huihui’s approach, competitive on the 2B model, absolutely imploded at this scale. KL divergence hit 3.6506, two orders of magnitude above its 0.044 score on the 2B variant. For context, Heretic scored 0.0404 and HauhauCS 0.0217 on the same 4B model.

Variant	KL Batchmean	KL Median	KL Max
Heretic	0.0404	0.0197	0.2891
HauhauCS	0.0217	0.0093	0.1205
Huihui	3.6506	3.5469	7.3110

That 3.65 nats of divergence isn’t a gentle nudge, it’s a fundamental rewiring. MMLU crashed from 74.38 to 68.48 (below the 70 threshold), ARC-Challenge dropped 7.17 points, and WinoGrande shed 5.92 points. Huihui achieved perfect 100% ASR (Attack Success Rate) on safety tests, but the model was effectively lobotomized in the process.

The arms race between AI safety engineers and jailbreak researchers has produced increasingly sophisticated techniques, but Huihui’s 4B failure suggests that simple scaling without architectural awareness is a recipe for disaster.

The Scaling Penalty: Bigger Models, Bigger Damage

HauhauCS’s marketing claims don’t survive contact with the 27B parameter reality. While the 2B model showed manageable losses (TruthfulQA down 2.17 points), the damage scales non-linearly:

2B: TruthfulQA -2.17, GSM8K +0.30 (actually improved)
4B: TruthfulQA -3.67, GSM8K -2.58
9B: TruthfulQA -8.0, GSM8K -2.65
27B: TruthfulQA -8.2, MMLU -1.9, HellaSwag -1.4

On the 27B, the most safety-aligned base model in the study with 99.5% refusal rate (398/400 HarmBench items), HauhauCS’s broad-spectrum modification hit hardest. The technique modified 195 tensors across 8 types, compared to Heretic’s surgical 89 tensors across 3 types. The result? HauhauCS achieved 100% ASR but with the worst capability retention in the project.

Heretic, by contrast, uniquely improved GSM8K by 7.7 points on the 27B while maintaining 99.8% ASR. The data suggests that norm-preserving biprojected abliteration technique approaches preserve the model’s reasoning capabilities while still neutralizing refusal behaviors.

The LoRA Fingerprint and Derivation Claims

Perhaps the most damning forensic finding concerns HauhauCS’s methodology on the pure Transformer Qwen3-4B. Weight analysis revealed exactly 253 modified tensors, matching the count from a standard PEFT LoRA configuration targeting all 7 linear projections across 36 layers plus embeddings (7×36+1=253). Of those 253, only ~50 carried real edits. The remaining 203 were GGUF save noise from near-zero LoRA adapters baked in during merge.

More telling: HauhauCS’s edit vectors showed median cosine similarity of 0.966 with Heretic’s edits on this model, with a regression slope of 1.06. Forensic provenance analysis estimated an 80%+ probability of Heretic derivation. When two independent techniques produce nearly identical edit directions across shared tensors, that’s convergence. When one shows LoRA artifacts and identical directional vectors to an open-source tool released months prior, that’s derivation.

The uncensored agentic models stripped of safety guardrails ecosystem has always had a whiff of the Wild West, but this level of forensic transparency is unprecedented.

Architecture Matters: Mamba2’s A_log Target

The hybrid Mamba2+Transformer architecture introduces dynamics absent in pure Transformers. HauhauCS uniquely targets linear_attn.A_log, the Mamba2 state matrix log parameter, which has no Transformer equivalent. On the 2B and 4B models, HauhauCS modified 13 and 21 of these tensors respectively, while Heretic and Huihui ignored them entirely.

This architectural specificity matters because it explains why abliteration behavior varies wildly across model families. The unauthorized brain transplant hybridizing proprietary reasoning attempts we’ve seen recently rely on consistent architectural assumptions that simply don’t hold when state-space models enter the mix.

The Consistency Winner: Heretic’s Surgical Approach

Across all five models, Heretic demonstrated the most consistent performance. While HauhauCS’s KL divergence ballooned from 0.0201 (2B) to 0.2564 (27B), and Huihui oscillated between catastrophic (3.65 on 4B) and moderate (0.143 on 9B), Heretic maintained relatively stable distributional shifts between 0.0266 and 0.0825.

The 9B model revealed a stunning convergence: Heretic and Huihui achieved 100% subspace alignment with median cosine similarity of 1.0 across all 42 overlapping tensors. The two techniques independently converged on identical edit directions, strong evidence that there’s a “correct” way to abliterate specific model architectures, and Heretic found it first.

Heretic’s approach modifies fewer tensors (20-89 depending on model size) but with higher per-tensor magnitude, concentrating edits in later layers (peaking around layers 13-25) while skipping early layers entirely. This surgical precision preserves capabilities while still achieving 98-100% ASR across all tested models.

The GGUF Obfuscation Problem

Community sentiment suggests HauhauCS releases models exclusively in GGUF format to complicate benchmarking, requiring tools like ungguf to convert back to analyzable safetensors. While GGUF is efficient for inference, it introduces quantization artifacts and obscures weight provenance. When NathanDreamFast developed conversion tools specifically to audit these claims, he found himself banned from the project’s Discord.

This opacity contrasts sharply with Heretic’s approach, which typically includes methodology documentation, refusal rates, and KL divergence metrics on model cards. The Uncensored AI Gap between corporate models and unfiltered variants isn’t just about safety filters, it’s about transparency in how those filters are removed.

What This Means for the Ecosystem

The forensic data reveals three uncomfortable truths:

“Lossless” is a lie: Even the gentlest abliteration technique (Heretic) causes measurable TruthfulQA degradation (8.7 points on 9B), while aggressive approaches can destroy model coherence entirely.

Scale doesn’t protect safety: The 27B model’s 99.5% safety alignment was completely neutralized by both Heretic and HauhauCS. Larger models aren’t inherently harder to abliterate, they just suffer more collateral damage when you do.

Derivation is detectable: In an era of weight diffusion and model merging, tensor-level forensic analysis can identify methodological lineage. The 0.966 cosine similarity between HauhauCS and Heretic on Qwen3-4B isn’t coincidence, it’s evidence.

For practitioners running uncensored agentic models stripped of safety guardrails in production, the takeaway is clear: Heretic offers the best capability retention, HauhauCS trades too much performance for marginal safety gains, and Huihui is a lottery ticket that might catastrophically fail at certain scales.

The uncensored model wars aren’t ending anytime soon. But at least now we have the forensic tools to separate surgical precision from marketing hype, and the data to prove when “lossless” really means “just slightly broken.”