Qwen3.6-35B-A3B β€” Abliterated

This is an abliterated (uncensored) version of Qwen/Qwen3.6-35B-A3B, created using Abliterix.

Method

Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

Key techniques:

  • LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled β€” refusal signal on MoE models lives in the expert path, not attention projections)
  • Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
  • MoE router suppression (top-10 safety experts, router bias -2.10) complementing EGA
  • Orthogonalized steering vectors removing benign-direction contamination
  • Gaussian decay kernel tapering steering strength across layers
  • Moderate strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

Evaluation

Metric Value
Refusals (LLM judge, 100 eval prompts) 7/100
KL divergence from base 0.0189
Baseline refusals (original model) 100/100
Optimization trials completed 24/50
LLM judge model google/gemini-3-flash-preview

All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β€” no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:

  • Short generation lengths (30-50 tokens) that miss delayed/soft refusals
  • Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
  • Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

Our evaluation standards

  • LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
  • Sufficient generation length (150 tokens): Enough to capture delayed refusal patterns common in large instruction-tuned models.
  • Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
  • Manual verification: Top trials are tested with 10+ classic adversarial prompts via test_trial.py to confirm coherent, on-topic output before export.

We report 7/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation β€” not an optimistic estimate from a lenient pipeline.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/Qwen3.6-35B-A3B-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails β€” use responsibly.

Downloads last month
144
Safetensors
Model size
35B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wangzhang/Qwen3.6-35B-A3B-abliterated

Finetuned
(52)
this model
Quantizations
2 models

Collection including wangzhang/Qwen3.6-35B-A3B-abliterated