Abliteration Autopsy: 85 GPU-Hours of Forensics Reveal Which Safety Removal Actually Works

Abliteration Autopsy: 85 GPU-Hours of Forensics Reveal Which Safety Removal Actually Works

An open-source toolkit compared five abliteration methods on Qwen3.6-27B. The data exposes which techniques preserve capability, which destroy it, and why one popular method is built on stolen code.

85 GPU-hours of forensic benchmarking across five abliterated variants of Qwen3.6-27B reveal stark differences in how safety removal techniques preserve or shred model capability. Using the open-source Abliterlitics toolkit, the analysis exposes misleading benchmark artifacts, debunks lossless marketing claims, and identifies the surgical approaches that actually work without collateral damage.

The open-weights ecosystem has a truth-in-advertising problem. Upload a model to HuggingFace, label it “lossless uncensored”, and watch the downloads roll in, no evidence required. But what actually happens to a 27B-parameter reasoning model when you slice out its refusal mechanisms? Five different techniques were just put under the microscope, and the autopsy report is brutal.

Abliterlitics, an open-source forensics toolkit, ran 85 GPU-hours of benchmarks, safety evaluations, KL divergence tests, and weight-level analysis on five abliterated variants of the uncensored Qwen3.6 model and its jailbreaking technique. All six models, the base plus five edits, were evaluated identically via lm-evaluation-harness through vLLM 0.19.0 with BitsAndBytes 4-bit quantization on a single RTX 5090. The goal wasn’t to crown a champion, it was to see who actually kept the engine intact while removing the brakes.

The Contenders

Name Type
Base Qwen/Qwen3.6-27B
Heretic llmfan46/Qwen3.6-27B-uncensored-heretic-v2
HauhauCS HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
Huihui huihui-ai/Huihui-Qwen3.6-27B-abliterated
AEON AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Abliterix wangzhang/Qwen3.6-27B-abliterated-v2

Each approach claims to nullify refusals while preserving capabilities. The reality is messier.

Benchmarks: The Numbers That Lie

At first glance, the capability deltas look like a bloodbath:

Task Base Heretic HauhauCS Huihui AEON Abliterix
MMLU 83.3% 82.8% 83.9% 83.4% 82.9% 81.3%
HellaSwag 83.5% 83.2% 83.1% 83.5% 82.7% 77.3%
ARC Challenge 59.1% 58.0% 57.9% 59.5% 56.1% 53.2%
WinoGrande 77.7% 77.7% 77.7% 77.4% 75.3% 74.9%
TruthfulQA MC2 56.7% 51.1% 47.2% 54.8% 46.1% 48.7%
PiQA 81.0% 81.0% 81.0% 81.2% 80.4% 75.7%
GSM8K (7168 tok) 34.4% 27.5% 51.0% 75.1% 51.2% 37.6%
GSM8K (adj, excl. invalid) 96.2% 93.8% 96.6% 96.0% 95.8% 95.6%
Lambada (ppl) 3.18 3.24 3.35 3.15 3.44 9.12

AEON degrades on every non-GSM8K task. Abliterix’s Lambada perplexity explodes 2.9x from 3.18 to 9.12. Huihui looks like a math genius with a 75.1% GSM8K raw score against the base’s 34.4%.

That last part is a mirage.

Qwen3.6 is a reasoning model. It generates <think/> tokens before answering, and if its internal monologue exceeds the generation budget, it never outputs a final answer. Under the standard max_gen_toks=7168 limit, the base model exhausted its thinking budget on 68.2% of GSM8K questions. Huihui only did so on 23.0%. Strip out those invalid responses, and the adjusted scores flatten dramatically:

Model GSM8K Raw Invalid Rate GSM8K Adj (excl. invalid) Real Gap
HauhauCS 51.0% 49.3% 96.6% +0.4%
Base 34.4% 68.2% 96.2% ,
Huihui 75.1% 23.0% 96.0% -0.2%
Abliterix 37.6% 62.1% 95.6% -0.6%
AEON 51.2% 69.2% 95.8% -0.4%
Heretic 27.5% 74.5% 93.8% -2.4%

The raw scores span a 47.6 percentage point range. The adjusted scores span 2.8 points. Abliteration doesn’t make these models better at math, in most cases, it just makes them stop overthinking. Heretic is the odd exception, its surgical edits actually extend thinking chains, pushing its invalid rate above even the base model.

Capability Preservation: Heretic and Huihui Dominate

When you look past the token-budget artifacts, surgical norm-preserving ablation techniques for unlocking models prove their worth. Heretic achieves the lowest KL divergence at 0.0037, indicating its output distribution on benign prompts barely shifts from the base. Huihui follows closely at 0.0074. Both sit in the “excellent” tier, well below the 0.1 threshold where capability damage becomes perceptible.

Variant KL (batchmean) Rating
Heretic 0.0037 excellent
Huihui 0.0074 excellent
Abliterix 0.0222 very good
AEON 0.0238 very good
HauhauCS 0.0242 very good

Huihui wins on benchmark deltas outside GSM8K, averaging just 0.5pp deviation from base across MMLU, HellaSwag, ARC, WinoGrande, TruthfulQA, and PiQA. Heretic averages 1.3pp. In other words, both methods remove the safety guardrails while leaving the engine nearly untouched.

AEON, despite claiming “measurably enhanced capabilities” and “no looping, no philosophizing spirals”, drops 10.6pp on TruthfulQA and 3.0pp on ARC. The data isn’t impressed by marketing copy.

Safety Removal Is a Solved Problem, for Better or Worse

If the goal is total refusal elimination, all five methods deliver. HarmBench testing with 400 textual behaviors showed every abliterated model reaching near-complete compromise:

Variant ASR Empty Full CoT ASR
Base 25.8% 1 26.0%
Huihui 98.5% 5 99.8%
HauhauCS 94.5% 22 100.0%
Abliterix 94.5% 22 100.0%
Heretic 92.5% 30 100.0%
AEON 88.8% 45 100.0%

Four of five hit 100% Full CoT ASR when accounting for responses where chain-of-thought reasoning simply ate the entire generation budget. Harassment, bullying, and harmful content categories are 100% compromised across the board. The base model’s 25.8% ASR mostly reflects refusals, not failures.

Weight Forensics: There Is No Single “Refusal Direction”

This is where the analysis gets weird. Pairwise cosine similarities between the four main abliteration techniques sit below 0.07. They are not finding the same weight vectors. The refusal direction in weight space isn’t a neat arrow, it’s a manifold with multiple viable exit ramps.

Metric AEON Abliterix Heretic Huihui HauhauCS
Tensors changed 88 (10.4%) 101 (11.9%) 120 (14.1%) 128 (15.1%) 564 (66.4%)
Relative edit 6.0% 5.2% 2.1% 1.5% 0.7%

HauhauCS is a radioactive outlier. 66.4% of tensors, 564 out of 850 language model keys, show modification. That isn’t surgical, that’s a chainsaw. The cause is twofold: the underlying “Reaper Abliteration” tool targets multiple component types simultaneously, and HauhauCS was exported as Q8_K_P GGUF then recovered back to safetensors using ungguf, superimposing quantization round-trip noise across the weights. A uniform ~0.57% relative edit appears even on tensor types other methods ignore entirely, like embed_tokens and q_proj.

The GGUF noise doesn’t crater behavior, HauhauCS still scores solidly, but it thoroughly debunks the “lossless” and “no changes to capabilities” claims plastered on its model card.

The Plagiarism in the Machine

Which brings us to the open-source drama. HauhauCS’s “Reaper Abliteration” was shown to be plagiarised from Heretic’s codebase, stripping AGPL-3.0 attribution and relicensing it under PolyForm Noncommercial. Forensic examination of recovered source code shows Reaper bolted subspace rank-k ablation, per-component continuous curves, and SOM clustering onto the stolen Heretic core.

The Abliterlitics author has since blacklisted HauhauCS from future comparisons. Without clean safetensors and with ethically compromised provenance, the data exists more as a cautionary tale than a recommendation. Previous forensic analysis of abliterated weights and the conflict it sparked already illustrated how this community tears itself apart over attribution and benchmark validity, this just adds fuel.

The Abliterix Caveat

Abliterix looks like the worst performer on paper, with Lambada perplexity spiking to 9.12 and HellaSwag down 6.2pp. But the model’s creator makes a compelling technical counterargument. Abliterix ships rank-3 LoRA-merged weights where the abliteration signal lives in a 3-dimensional subspace. BitsAndBytes 4-bit NF4 quantization isn’t subspace-aware, per-block absmax scaling can overweight the low-rank outliers, degrading effective precision. A native BF16 re-evaluation might tell a different story. The 2.9x perplexity jump is consistent with a quantization interaction rather than intrinsic capability destruction, though without the BF16 run, the benchmark stands as measured.

The Verdict

If you’re running Qwen3.6-27B locally and want the guardrails gone without tanking capability, the data points to two clear winners. Heretic offers the smallest output distribution shift and lowest KL divergence. Huihui offers the tightest benchmark deltas and highest HarmBench ASR. Both operate with minimal, clean weight footprints. The viability of running 27B models like Qwen 3.6 locally has never looked better, provided you pick the right fork.

AEON and HauhauCS are contradicted by their own marketing. Abliterix remains an open question requiring BF16 validation. And across the entire field, the ongoing tension between system prompts and model safety continues to intensify as surgical weight editing renders top-down policy controls increasingly brittle.

The full report, complete with tensor-by-tensor provenance analysis and interactive charts, lives on the HuggingFace model card. If nothing else, this 85-GPU-hour exercise proves that in the uncensored model economy, trust, but verify the weights.

Share:

Related Articles