⚡ Huihui-Qwopus3.5-27B-v3-abliterated — HLWQ CT INT4

CompressedTensors INT4 of Jackrong/Qwopus3.5-27B-v3 (abliterated) via HLWQ (Hadamard-Lloyd Weight Quantization)

Native vLLM. Marlin kernel. Zero plugin. 168 tok/s on A100.

📊 Compression

Compression

Metric Value
📦 Format CompressedTensors INT4 symmetric (gs=128)
💾 Model size 15.2 GB (3 shards)
📉 Compression 72% (54 → 15.2 GB)
⚡ Kernel Marlin (fused dequant+matmul)
🏗️ Architecture Qwen3.5 hybrid — 64 layers (48 GDN + 16 Full Attention)
🔢 Parameters 27B dense

🏎️ Quick Start — One Command

pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5 \
  --language-model-only --enforce-eager

No plugin. No pip install polarquant. No custom code.

📊 Speed

Speed Benchmark

GPU tok/s VRAM
A100 80GB 168 15.2 GB
RTX PRO 6000 96GB 18 15.2 GB
RTX 4090 24GB ~10 15.2 GB

🎯 Quality

Quality PPL

Method PPL (WikiText-2) Delta
BF16 baseline 6.37
HLWQ → INT4 (ours) 6.56 +0.19
Direct INT4 (naive) 6.68 +0.31

HLWQ produces better INT4 weights than direct quantization — 0.12 PPL improvement from Hadamard rotation + Lloyd-Max preprocessing.

💻 GPU Compatibility

GPU Compatibility

GPU VRAM Status
RTX 4060 Ti 16 GB ⚠️ Tight (15.2 GB model + KV cache)
RTX 4090 24 GB ✅ Comfortable
A100 / H100 80 GB ✅ Full speed
RTX PRO 6000 96 GB ✅ Full speed

🧬 Architecture

Architecture

Spec Value
Parameters 27B (dense)
Layers 64 (48 GDN + 16 Full Attention)
Hidden dim 3584
Head dim 128
Context 131,072 tokens
Vision Multimodal ViT (skipped with --language-model-only)

🔬 How HLWQ Works

Standard INT4 quantizes weights directly — outliers cause high error. HLWQ adds a preprocessing step before INT4:

BF16 weights
    │
    ▼
[1] Hadamard rotation → distributes energy uniformly
    │  (eliminates outliers, weights become Gaussian)
    │
    ▼
[2] Lloyd-Max Q5 → MSE-optimal 5-bit quantization
    │  (best possible codebook for Gaussian distribution)
    │
    ▼
[3] Dequant → BF16 → INT4 symmetric (gs=128)
    │  (cleaner weights = better INT4)
    │
    ▼
CompressedTensors (Marlin kernel) → vLLM serve

Same speed as GPTQ/AWQ, better quality.

🔧 Usage with Transformers

pip install polarquant
import polarengine_vllm  # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    trust_remote_code=True
)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

⚠️ Important Flags

Flag Why
--language-model-only Qwen3.5 is multimodal — skip vision encoder (text weights only)
--enforce-eager Required on Blackwell GPUs (cc 12.0). Optional on Ampere/Hopper

📖 Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}

🔗 Links

Resource Link
📄 Paper arXiv:2603.29078
🔧 Code GitHub
📦 PyPI pip install polarquant
🏠 Base model Jackrong/Qwopus3.5-27B-v3
🔀 vLLM plugin polarengine-vllm
Downloads last month
2,641
Safetensors
Model size
26B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5

Base model

Qwen/Qwen3.5-27B
Quantized
(30)
this model

Collection including caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5

Paper for caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5