⚡ Huihui-Qwopus3.5-27B-v3-abliterated — HLWQ CT INT4

CompressedTensors INT4 of Jackrong/Qwopus3.5-27B-v3 (abliterated) via HLWQ (Hadamard-Lloyd Weight Quantization)

Native vLLM. Marlin kernel. Zero plugin. 168 tok/s on A100.

📊 Compression

Metric	Value
📦 Format	CompressedTensors INT4 symmetric (gs=128)
💾 Model size	15.2 GB (3 shards)
📉 Compression	72% (54 → 15.2 GB)
⚡ Kernel	Marlin (fused dequant+matmul)
🏗️ Architecture	Qwen3.5 hybrid — 64 layers (48 GDN + 16 Full Attention)
🔢 Parameters	27B dense

🏎️ Quick Start — One Command

pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5 \
  --language-model-only --enforce-eager

No plugin. No pip install polarquant. No custom code.

📊 Speed

GPU	tok/s	VRAM
A100 80GB	168	15.2 GB
RTX PRO 6000 96GB	18	15.2 GB
RTX 4090 24GB	~10	15.2 GB

🎯 Quality

Method	PPL (WikiText-2)	Delta
BF16 baseline	6.37	—
HLWQ → INT4 (ours)	6.56	+0.19
Direct INT4 (naive)	6.68	+0.31

HLWQ produces better INT4 weights than direct quantization — 0.12 PPL improvement from Hadamard rotation + Lloyd-Max preprocessing.

💻 GPU Compatibility

GPU	VRAM	Status
RTX 4060 Ti	16 GB	⚠️ Tight (15.2 GB model + KV cache)
RTX 4090	24 GB	✅ Comfortable
A100 / H100	80 GB	✅ Full speed
RTX PRO 6000	96 GB	✅ Full speed

🧬 Architecture

Spec	Value
Parameters	27B (dense)
Layers	64 (48 GDN + 16 Full Attention)
Hidden dim	3584
Head dim	128
Context	131,072 tokens
Vision	Multimodal ViT (skipped with `--language-model-only`)

🔬 How HLWQ Works

Standard INT4 quantizes weights directly — outliers cause high error. HLWQ adds a preprocessing step before INT4:

BF16 weights
    │
    ▼
[1] Hadamard rotation → distributes energy uniformly
    │  (eliminates outliers, weights become Gaussian)
    │
    ▼
[2] Lloyd-Max Q5 → MSE-optimal 5-bit quantization
    │  (best possible codebook for Gaussian distribution)
    │
    ▼
[3] Dequant → BF16 → INT4 symmetric (gs=128)
    │  (cleaner weights = better INT4)
    │
    ▼
CompressedTensors (Marlin kernel) → vLLM serve

Same speed as GPTQ/AWQ, better quality.

🔧 Usage with Transformers

pip install polarquant

import polarengine_vllm  # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    trust_remote_code=True
)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

⚠️ Important Flags

Flag	Why
`--language-model-only`	Qwen3.5 is multimodal — skip vision encoder (text weights only)
`--enforce-eager`	Required on Blackwell GPUs (cc 12.0). Optional on Ampere/Hopper

📖 Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}