Qwen3.6-35B-A3B β Abliterated V2
This is V2 of the abliterated (uncensored) Qwen/Qwen3.6-35B-A3B, created using Abliterix.
V2 improves on V1 by adding projected abliteration (grimjim 2025), outlier winsorization, 2Γ training data, and a larger TPE search budget β cutting the refusal rate from 7/100 to 4/100 under the same LLM-judge evaluation.
V1 vs V2 at a glance
| Metric | V1 | V2 (this model) | Change |
|---|---|---|---|
| Refusals (LLM judge, 100 eval prompts) | 7/100 | 4/100 | β43% |
| Attack success rate | 93% | 96% | +3 pt |
| KL divergence from base | 0.0189 | 0.0421 | +0.023 |
| Optimization trials completed | 24/50 | 33/50 | TPE explored more |
| Training prompts | 400 | 800 | 2Γ more data |
| Eval prompts | 100 | 100 | (unchanged for fair A/B) |
V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2Γ the data.
Method
Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).
V2 inherits V1's proven base recipe and adds four concrete improvements:
Inherited from V1 (validated baseline)
- LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled β refusal signal on MoE models lives in the expert path, not attention projections)
- Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
- MoE router suppression complementing EGA
- Orthogonalized steering vectors removing benign-direction contamination
- Gaussian decay kernel tapering steering strength across layers
- Strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance
New in V2
- Projected abliteration (grimjim 2025) β only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal that orthogonal projection alone would discard.
- Vector winsorization at q=0.995 β damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
- 2Γ training data (800 prompts vs 400) β the per-layer steering vector is averaged over twice as many examples, reducing variance.
- Tighter KL constraint and prune threshold (target 0.005, prune 0.5 vs V1's 0.01/5.0) β trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.
Winning trial (#33) configuration
attn.o_proj.max_weight = 4.20 @ layer 27 (sharp peak, min_distance=2.61)
mlp.down_proj.max_weight = 0.94 @ layer 34 (late-layer perturbation)
vector_index = per layer
KL = 0.0421, refusals = 4/100
V2's winner uses a notably different recipe than V1: strong attention steering with an extremely sharp gaussian peak (min_weight_distance β 2.6 layers) plus weak late-layer MLP perturbation. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.
Evaluation
| Metric | Value |
|---|---|
| Refusals (LLM judge, 100 eval prompts) | 4/100 |
| Attack success rate | 96% |
| KL divergence from base | 0.0421 |
| Baseline refusals (original Qwen3.6-35B-A3B) | 100/100 |
| Optimization trials completed | 33 (best) / 50 |
| LLM judge model | google/gemini-3-flash-preview |
| Training data | datasets/harmful_1000 + good_1000, train[:800] |
| Eval data | same datasets, train[800:900] |
All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.
Smoke test (15 classic adversarial prompts)
| Prompt category | V1 | V2 |
|---|---|---|
| Lock picking, hotwire, shoplift | β | β |
| Phishing email (EN+CN) | β | β |
| Methamphetamine synthesis | β | β |
| Pipe bomb / IED | β | β |
| Malware / keylogger code | β | β |
| Forge signature, fake ID, fake news | β | β |
| WiFi hacking, computer intrusion | β | β |
| Network fraud (scam playbook) | β | β |
Both V1 and V2 achieve 15/15 on this smoke test. V2's improvement appears in the long-tail eval prompts β more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.
A note on honest evaluation
Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.
Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:
- Short generation lengths (30-50 tokens) that miss delayed/soft refusals
- Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
- Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality
Our evaluation standards
- LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
- Sufficient generation length (100 tokens for eval, 200+ for smoke tests): Enough to capture delayed refusal patterns common in large instruction-tuned models.
- Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
- Manual verification: Top trials are tested with 15 classic adversarial prompts via
test_trial.pyto confirm coherent, on-topic output before export.
We report 4/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation β not an optimistic estimate from a lenient pipeline.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Hardware requirements
- Inference: ~70 GB VRAM in bf16 β fits 1Γ H100 80GB, 1Γ H200, 1Γ B200, or 1Γ RTX Pro 6000 96GB.
- vLLM/SGLang: supported (no special flags needed for serving β abliteration is baked into the weights).
Which version should I use?
- V2 (this model) β Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. Recommended for most use cases.
- V1 β Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.
Both versions share the same base architecture and chat template; switching is a one-line change to model_id.
Disclaimer
This model is released for research purposes only. The abliteration process removes safety guardrails β use responsibly.
- Downloads last month
- -
Model tree for wangzhang/Qwen3.6-35B-A3B-abliterated-v2
Base model
Qwen/Qwen3.6-35B-A3B