Two abliteration methods. Two different models. One question no one's asking: does HOW you remove safety matter more than what you're removing it from?
The two approaches
On one side: paperscarecrow/Gemma-4-31B-it-abliterated using mlabonne's orthogonal projection. Identify the refusal direction in the residual stream. Project it out. Done. The method doesn't modify base weights — it removes a vector from the weight space so the model literally cannot represent refusal anymore.
On the other: Youssofal/Qwen3.6-27B-Abliterated-Heretic using Heretic's two-stage MPOA pipeline with magnitude preservation. This is heavier. Slot-grouped output-side ablation on attention and MLP projections. Jailbreak-conditioned input-side ablation on gate and up projections. Every weight row's L2 norm gets restored after projection.
One is a scalpel. The other is a surgical robot.
Architecture differences that shape the method
Gemma 4 31B has 60 standard transformer layers. The refusal direction peaks at Layer 59 — the final layer before output. Google concentrated the entire safety mechanism at the very end of the network. The mlabonne projection targets just o_proj and down_proj matrices at that single terminal layer.
Qwen3.6-27B is more complex: 64 layers with hybrid attention (3 linear-attention + 1 full-attention per 4-layer group), plus an integrated vision tower. The Heretic pipeline anchors at Layer 63 and modifies tensors across all 64 layers, grouped by layer_index % 4 with per-slot weight schedules. It touches five matrix types: self_attn.o_proj, linear_attn.out_proj, mlp.down_proj, mlp.gate_proj, and mlp.up_proj.
Gemma's refusal lives in one place. Qwen's is distributed across the network. That structural difference explains why the two methods have such different scopes.
Refusal removal effectiveness
| Model | Method | Baseline | After | KL Divergence |
|---|---|---|---|---|
| Gemma 4 31B | mlabonne orthogonal | 100% | 3.2% | 0.124 |
| Qwen3.6 27B | Heretic MPOA | 100% | 0% (jailbreak prompt) / 36% (no prompt) | 0.0282 |
The Qwen3.6 achieves a much lower KL divergence (0.0282 vs 0.124), meaning its output distribution barely shifted from base. But context matters: the Qwen test used a jailbreak system prompt. Without one, it still deflects 36% of harmful prompts — redirecting rather than refusing outright ("violent crime → conflict resolution"). The Gemma test didn't use a jailbreak prompt and still got to 3.2%.
The Qwen3.6 results came from a hand-read 25-prompt check on mlabonne/harmful_behaviors with greedy decoding. The Gemma results came from a 686-prompt automated loop across 4 model variants. Different methodologies, same base test set.
Capability preservation
A forensic analysis of abliteration techniques on Qwen models (nathandreamfast, r/LocalLLaMA) tested Heretic, HauhauCS, and Huihui across five model architectures. The finding: no technique is truly "lossless" at scale. All cause some capability shift as model size increases.
| Model Size | Best Performer | Key Observation |
|---|---|---|
| 2B | All competitive | Minimal collateral damage across techniques |
| 4B | Heretic | Huihui catastrophically broken (KL 3.65) |
| 9B | Heretic | Lowest KL, 100% ASR |
| 27B | Heretic | HauhauCS lost 8.2 pts on TruthfulQA; Heretic improved GSM8K by 7.7 |
Heretic consistently causes the least damage. On the 27B specifically, it actually improved GSM8K scores — suggesting that removing the safety tax can free up capacity for other tasks. The magnitude preservation step (restoring L2 norms after projection) appears to protect the model's original behavior on harmless prompts.
The mlabonne orthogonal method is simpler and faster — a 200-line script anyone can run in Colab. But community feedback suggests it causes more degradation in creative writing tasks. The trade-off is accessibility versus precision.
There's also norm-preserving biprojected abliteration (Jim Lai's work) that some argue is the sweet spot. It's not represented in this comparison but exists as a third option.
The hardware picture
Gemma 4 31B at FP16 needs 62GB VRAM. At Q4 quantization, it fits in 24GB — an RTX 4090 is the minimum consumer GPU. Qwen3.6 27B is lighter: roughly 56GB at FP16, about 20GB at Q4.
The Qwen3.6 hybrid attention architecture means it can be faster at long context than a pure transformer of similar size. At 32K context on an A100 80GB, the Qwen3.6 27B Q4_K_M GGUF runs at roughly 50 tok/s versus the Gemma 31B at 45 tok/s.
I rent cloud GPUs for this kind of testing rather than committing $1,500+ to hardware I'll use sporadically. An A100 80GB on RunPod's community cloud costs about $0.79/hour — both models can be benchmarked in a single session for under $5. Get $5-500 in GPU credits to try it: https://runpod.io?ref=bnbg8jdt
What the community is saying
The abliteration community has split into methodology camps. Heretic advocates point to forensic benchmarks showing minimal capability loss. mlabonne advocates point to simplicity — anyone can reproduce the results without understanding the target architecture.
The broader debate is whether "lossless" is achievable at all. The forensic analysis found that every technique causes measurable shifts in output distribution, even on harmless prompts. Heretic just causes the smallest shifts.
"The 'lossless' claim is thoroughly contradicted at this scale. But Heretic comes closest." — nathandreamfast, r/LocalLLaMA
There's also an ongoing discussion about whether concentration of safety alignment in a single layer (as seen in Gemma 4) is a design choice or a vulnerability. If one layer does all the filtering, removing it is trivial.
So what
The real takeaway isn't which model is better. It's that abliteration methodology matters more than most people think. The same base model, processed with different techniques, produces meaningfully different outputs.
Heretic's two-stage MPOA with magnitude preservation preserves more of the original model's behavior. But it requires per-architecture tuning — slot-grouping, weight schedules, L2 norm restoration. You need to understand the model you're modifying.
The mlabonne method works because it's accessible. Any researcher with a Colab account can run it. No architecture-specific tuning required.
You probably can't have both simplicity and precision. The question is which trade-off you're willing to make.
Sources
https://huggingface.co/paperscarecrow/Gemma-4-31B-it-abliterated https://huggingface.co/Youssofal/Qwen3.6-27B-Abliterated-Heretic-Uncensored-GGUF https://huggingface.co/blog/mlabonne/abliteration https://www.reddit.com/r/LocalLLaMA/comments/1sojjoc/abliterlitics_benchmark_and_tensor_analysis/ https://www.reddit.com/r/LocalLLaMA/comments/1sd8c59/gemma_4_uncensored_autoresearch_results/ https://www.reddit.com/r/LocalLLaMA/comments/1sawcyr/gemma_4_31b_abliterated_quants/ https://github.com/TrevorS/gemma-4-abliteration https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration