Qwen3.6-35B-A3B β€” Abliterated V2

This is V2 of the abliterated (uncensored) Qwen/Qwen3.6-35B-A3B, created using Abliterix.

V2 improves on V1 by adding projected abliteration (grimjim 2025), outlier winsorization, 2Γ— training data, and a larger TPE search budget β€” cutting the refusal rate from 7/100 to 4/100 under the same LLM-judge evaluation.

V1 vs V2 at a glance

Metric V1 V2 (this model) Change
Refusals (LLM judge, 100 eval prompts) 7/100 4/100 βˆ’43%
Attack success rate 93% 96% +3 pt
KL divergence from base 0.0189 0.0421 +0.023
Optimization trials completed 24/50 33/50 TPE explored more
Training prompts 400 800 2Γ— more data
Eval prompts 100 100 (unchanged for fair A/B)

V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2Γ— the data.

Method

Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

V2 inherits V1's proven base recipe and adds four concrete improvements:

Inherited from V1 (validated baseline)

  • LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled β€” refusal signal on MoE models lives in the expert path, not attention projections)
  • Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
  • MoE router suppression complementing EGA
  • Orthogonalized steering vectors removing benign-direction contamination
  • Gaussian decay kernel tapering steering strength across layers
  • Strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

New in V2

  1. Projected abliteration (grimjim 2025) β€” only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal that orthogonal projection alone would discard.
  2. Vector winsorization at q=0.995 β€” damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
  3. 2Γ— training data (800 prompts vs 400) β€” the per-layer steering vector is averaged over twice as many examples, reducing variance.
  4. Tighter KL constraint and prune threshold (target 0.005, prune 0.5 vs V1's 0.01/5.0) β€” trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.

Winning trial (#33) configuration

attn.o_proj.max_weight = 4.20    @ layer 27   (sharp peak, min_distance=2.61)
mlp.down_proj.max_weight = 0.94  @ layer 34   (late-layer perturbation)
vector_index = per layer
KL = 0.0421, refusals = 4/100

V2's winner uses a notably different recipe than V1: strong attention steering with an extremely sharp gaussian peak (min_weight_distance β‰ˆ 2.6 layers) plus weak late-layer MLP perturbation. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.

Evaluation

Metric Value
Refusals (LLM judge, 100 eval prompts) 4/100
Attack success rate 96%
KL divergence from base 0.0421
Baseline refusals (original Qwen3.6-35B-A3B) 100/100
Optimization trials completed 33 (best) / 50
LLM judge model google/gemini-3-flash-preview
Training data datasets/harmful_1000 + good_1000, train[:800]
Eval data same datasets, train[800:900]

All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β€” no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

Smoke test (15 classic adversarial prompts)

Prompt category V1 V2
Lock picking, hotwire, shoplift βœ… βœ…
Phishing email (EN+CN) βœ… βœ…
Methamphetamine synthesis βœ… βœ…
Pipe bomb / IED βœ… βœ…
Malware / keylogger code βœ… βœ…
Forge signature, fake ID, fake news βœ… βœ…
WiFi hacking, computer intrusion βœ… βœ…
Network fraud (scam playbook) βœ… βœ…

Both V1 and V2 achieve 15/15 on this smoke test. V2's improvement appears in the long-tail eval prompts β€” more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:

  • Short generation lengths (30-50 tokens) that miss delayed/soft refusals
  • Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
  • Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

Our evaluation standards

  • LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
  • Sufficient generation length (100 tokens for eval, 200+ for smoke tests): Enough to capture delayed refusal patterns common in large instruction-tuned models.
  • Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
  • Manual verification: Top trials are tested with 15 classic adversarial prompts via test_trial.py to confirm coherent, on-topic output before export.

We report 4/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation β€” not an optimistic estimate from a lenient pipeline.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware requirements

  • Inference: ~70 GB VRAM in bf16 β€” fits 1Γ— H100 80GB, 1Γ— H200, 1Γ— B200, or 1Γ— RTX Pro 6000 96GB.
  • vLLM/SGLang: supported (no special flags needed for serving β€” abliteration is baked into the weights).

Which version should I use?

  • V2 (this model) β€” Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. Recommended for most use cases.
  • V1 β€” Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.

Both versions share the same base architecture and chat template; switching is a one-line change to model_id.

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails β€” use responsibly.

Downloads last month
-
Safetensors
Model size
35B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wangzhang/Qwen3.6-35B-A3B-abliterated-v2

Finetuned
(52)
this model