---
license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
language:
  - en
  - zh
  - ko
  - ja
tags:
  - hlwq
  - quantized
  - compressed-tensors
  - int4
  - marlin
  - vllm
  - qwen3.5
  - hybrid-attention
  - gated-deltanet
  - abliterated
library_name: transformers
pipeline_tag: text-generation
---

# ⚡ Huihui-Qwopus3.5-27B-v3-abliterated — HLWQ CT INT4

**CompressedTensors INT4** of [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3) (abliterated) via HLWQ (Hadamard-Lloyd Weight Quantization)

> Native vLLM. Marlin kernel. Zero plugin. **168 tok/s on A100.**

## 📊 Compression

![Compression](assets/compression.png)

| Metric | Value |
|--------|-------|
| 📦 Format | CompressedTensors INT4 symmetric (gs=128) |
| 💾 Model size | **15.2 GB** (3 shards) |
| 📉 Compression | **72%** (54 → 15.2 GB) |
| ⚡ Kernel | Marlin (fused dequant+matmul) |
| 🏗️ Architecture | Qwen3.5 hybrid — 64 layers (48 GDN + 16 Full Attention) |
| 🔢 Parameters | 27B dense |

## 🏎️ Quick Start — One Command

```bash
pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5 \
  --language-model-only --enforce-eager
```

No plugin. No `pip install polarquant`. No custom code.

## 📊 Speed

![Speed Benchmark](assets/speed.png)

| GPU | tok/s | VRAM |
|-----|-------|------|
| A100 80GB | **168** | 15.2 GB |
| RTX PRO 6000 96GB | **18** | 15.2 GB |
| RTX 4090 24GB | ~10 | 15.2 GB |

## 🎯 Quality

![Quality PPL](assets/quality_ppl.png)

| Method | PPL (WikiText-2) | Delta |
|--------|-------------------|-------|
| BF16 baseline | 6.37 | — |
| **HLWQ → INT4 (ours)** | **6.56** | **+0.19** |
| Direct INT4 (naive) | 6.68 | +0.31 |

**HLWQ produces better INT4 weights** than direct quantization — 0.12 PPL improvement from Hadamard rotation + Lloyd-Max preprocessing.

## 💻 GPU Compatibility

![GPU Compatibility](assets/gpu_compat.png)

| GPU | VRAM | Status |
|-----|------|--------|
| RTX 4060 Ti | 16 GB | ⚠️ Tight (15.2 GB model + KV cache) |
| RTX 4090 | 24 GB | ✅ Comfortable |
| A100 / H100 | 80 GB | ✅ Full speed |
| RTX PRO 6000 | 96 GB | ✅ Full speed |

## 🧬 Architecture

![Architecture](assets/architecture.png)

| Spec | Value |
|------|-------|
| Parameters | 27B (dense) |
| Layers | 64 (48 GDN + 16 Full Attention) |
| Hidden dim | 3584 |
| Head dim | 128 |
| Context | 131,072 tokens |
| Vision | Multimodal ViT (skipped with `--language-model-only`) |

## 🔬 How HLWQ Works

Standard INT4 quantizes weights directly — outliers cause high error.
HLWQ adds a **preprocessing step** before INT4:

```
BF16 weights
    │
    ▼
[1] Hadamard rotation → distributes energy uniformly
    │  (eliminates outliers, weights become Gaussian)
    │
    ▼
[2] Lloyd-Max Q5 → MSE-optimal 5-bit quantization
    │  (best possible codebook for Gaussian distribution)
    │
    ▼
[3] Dequant → BF16 → INT4 symmetric (gs=128)
    │  (cleaner weights = better INT4)
    │
    ▼
CompressedTensors (Marlin kernel) → vLLM serve
```

Same speed as GPTQ/AWQ, better quality.

## 🔧 Usage with Transformers

```bash
pip install polarquant
```

```python
import polarengine_vllm  # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5",
    trust_remote_code=True
)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

## ⚠️ Important Flags

| Flag | Why |
|------|-----|
| `--language-model-only` | Qwen3.5 is multimodal — skip vision encoder (text weights only) |
| `--enforce-eager` | Required on Blackwell GPUs (cc 12.0). Optional on Ampere/Hopper |

## 📖 Citation

```bibtex
@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}
```

## 🔗 Links

| Resource | Link |
|----------|------|
| 📄 Paper | [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) |
| 🔧 Code | [GitHub](https://github.com/caiovicentino/eoq-quantization) |
| 📦 PyPI | [`pip install polarquant`](https://pypi.org/project/polarquant/) |
| 🏠 Base model | [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3) |
| 🔀 vLLM plugin | [polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm) |