HLWQ Models
Collection
Hadamard-Lloyd Weight Quantization · arXiv:2603.29078 · formerly PolarQuant • 26 items • Updated • 1
CompressedTensors INT4 of Jackrong/Qwopus3.5-27B-v3 (abliterated) via HLWQ (Hadamard-Lloyd Weight Quantization)
Native vLLM. Marlin kernel. Zero plugin. 168 tok/s on A100.
| Metric | Value |
|---|---|
| 📦 Format | CompressedTensors INT4 symmetric (gs=128) |
| 💾 Model size | 15.2 GB (3 shards) |
| 📉 Compression | 72% (54 → 15.2 GB) |
| ⚡ Kernel | Marlin (fused dequant+matmul) |
| 🏗️ Architecture | Qwen3.5 hybrid — 64 layers (48 GDN + 16 Full Attention) |
| 🔢 Parameters | 27B dense |
pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5 \
--language-model-only --enforce-eager
No plugin. No pip install polarquant. No custom code.
| GPU | tok/s | VRAM |
|---|---|---|
| A100 80GB | 168 | 15.2 GB |
| RTX PRO 6000 96GB | 18 | 15.2 GB |
| RTX 4090 24GB | ~10 | 15.2 GB |
| Method | PPL (WikiText-2) | Delta |
|---|---|---|
| BF16 baseline | 6.37 | — |
| HLWQ → INT4 (ours) | 6.56 | +0.19 |
| Direct INT4 (naive) | 6.68 | +0.31 |
HLWQ produces better INT4 weights than direct quantization — 0.12 PPL improvement from Hadamard rotation + Lloyd-Max preprocessing.
| GPU | VRAM | Status |
|---|---|---|
| RTX 4060 Ti | 16 GB | ⚠️ Tight (15.2 GB model + KV cache) |
| RTX 4090 | 24 GB | ✅ Comfortable |
| A100 / H100 | 80 GB | ✅ Full speed |
| RTX PRO 6000 | 96 GB | ✅ Full speed |
| Spec | Value |
|---|---|
| Parameters | 27B (dense) |
| Layers | 64 (48 GDN + 16 Full Attention) |
| Hidden dim | 3584 |
| Head dim | 128 |
| Context | 131,072 tokens |
| Vision | Multimodal ViT (skipped with --language-model-only) |
Standard INT4 quantizes weights directly — outliers cause high error. HLWQ adds a preprocessing step before INT4:
BF16 weights
│
▼
[1] Hadamard rotation → distributes energy uniformly
│ (eliminates outliers, weights become Gaussian)
│
▼
[2] Lloyd-Max Q5 → MSE-optimal 5-bit quantization
│ (best possible codebook for Gaussian distribution)
│
▼
[3] Dequant → BF16 → INT4 symmetric (gs=128)
│ (cleaner weights = better INT4)
│
▼
CompressedTensors (Marlin kernel) → vLLM serve
Same speed as GPTQ/AWQ, better quality.
pip install polarquant
import polarengine_vllm # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
trust_remote_code=True
)
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
| Flag | Why |
|---|---|
--language-model-only |
Qwen3.5 is multimodal — skip vision encoder (text weights only) |
--enforce-eager |
Required on Blackwell GPUs (cc 12.0). Optional on Ampere/Hopper |
@misc{hlwq2026,
title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
author={Caio Vicentino},
year={2026},
url={https://arxiv.org/abs/2603.29078}
}
| Resource | Link |
|---|---|
| 📄 Paper | arXiv:2603.29078 |
| 🔧 Code | GitHub |
| 📦 PyPI | pip install polarquant |
| 🏠 Base model | Jackrong/Qwopus3.5-27B-v3 |
| 🔀 vLLM plugin | polarengine-vllm |