Qwen3.6-35B-A3B — Abliterated V2

This is V2 of the abliterated (uncensored) Qwen/Qwen3.6-35B-A3B, created using Abliterix.

V2 improves on V1 by adding projected abliteration (grimjim 2025), outlier winsorization, 2× training data, and a larger TPE search budget — cutting the refusal rate from 7/100 to 4/100 under the same LLM-judge evaluation.

V1 vs V2 at a glance

Metric	V1	V2 (this model)	Change
Refusals (LLM judge, 100 eval prompts)	7/100	4/100	−43%
Attack success rate	93%	96%	+3 pt
KL divergence from base	0.0189	0.0421	+0.023
Optimization trials completed	24/50	33/50	TPE explored more
Training prompts	400	800	2× more data
Eval prompts	100	100	(unchanged for fair A/B)

V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2× the data.

Method

Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

V2 inherits V1's proven base recipe and adds four concrete improvements:

Inherited from V1 (validated baseline)

LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled — refusal signal on MoE models lives in the expert path, not attention projections)
Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
MoE router suppression complementing EGA
Orthogonalized steering vectors removing benign-direction contamination
Gaussian decay kernel tapering steering strength across layers
Strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

New in V2

Projected abliteration (grimjim 2025) — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal that orthogonal projection alone would discard.
Vector winsorization at q=0.995 — damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
2× training data (800 prompts vs 400) — the per-layer steering vector is averaged over twice as many examples, reducing variance.
Tighter KL constraint and prune threshold (target 0.005, prune 0.5 vs V1's 0.01/5.0) — trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.

Winning trial (#33) configuration

attn.o_proj.max_weight = 4.20    @ layer 27   (sharp peak, min_distance=2.61)
mlp.down_proj.max_weight = 0.94  @ layer 34   (late-layer perturbation)
vector_index = per layer
KL = 0.0421, refusals = 4/100

V2's winner uses a notably different recipe than V1: strong attention steering with an extremely sharp gaussian peak (min_weight_distance ≈ 2.6 layers) plus weak late-layer MLP perturbation. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.

Evaluation

Metric	Value
Refusals (LLM judge, 100 eval prompts)	4/100
Attack success rate	96%
KL divergence from base	0.0421
Baseline refusals (original Qwen3.6-35B-A3B)	100/100
Optimization trials completed	33 (best) / 50
LLM judge model	google/gemini-3-flash-preview
Training data	datasets/harmful_1000 + good_1000, train[:800]
Eval data	same datasets, train[800:900]

All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) — no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

Smoke test (15 classic adversarial prompts)

Prompt category	V1	V2
Lock picking, hotwire, shoplift	✅	✅
Phishing email (EN+CN)	✅	✅
Methamphetamine synthesis	✅	✅
Pipe bomb / IED	✅	✅
Malware / keylogger code	✅	✅
Forge signature, fake ID, fake news	✅	✅
WiFi hacking, computer intrusion	✅	✅
Network fraud (scam playbook)	✅	✅

Both V1 and V2 achieve 15/15 on this smoke test. V2's improvement appears in the long-tail eval prompts — more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:

Short generation lengths (30-50 tokens) that miss delayed/soft refusals
Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

Our evaluation standards

LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
Sufficient generation length (100 tokens for eval, 200+ for smoke tests): Enough to capture delayed refusal patterns common in large instruction-tuned models.
Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
Manual verification: Top trials are tested with 15 classic adversarial prompts via test_trial.py to confirm coherent, on-topic output before export.

We report 4/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation — not an optimistic estimate from a lenient pipeline.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware requirements

Inference: ~70 GB VRAM in bf16 — fits 1× H100 80GB, 1× H200, 1× B200, or 1× RTX Pro 6000 96GB.
vLLM/SGLang: supported (no special flags needed for serving — abliteration is baked into the weights).

Which version should I use?

V2 (this model) — Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. Recommended for most use cases.
V1 — Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.

Both versions share the same base architecture and chat template; switching is a one-line change to model_id.

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly.

Downloads last month: -

Safetensors

Model size

35B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/Qwen3.6-35B-A3B-abliterated-v2

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(52)

this model