--- license: apache-2.0 base_model: Jackrong/Qwopus3.5-27B-v3 language: - en - zh - ko - ja tags: - hlwq - quantized - compressed-tensors - int4 - marlin - vllm - qwen3.5 - hybrid-attention - gated-deltanet - abliterated library_name: transformers pipeline_tag: text-generation --- # ⚡ Huihui-Qwopus3.5-27B-v3-abliterated — HLWQ CT INT4 **CompressedTensors INT4** of [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3) (abliterated) via HLWQ (Hadamard-Lloyd Weight Quantization) > Native vLLM. Marlin kernel. Zero plugin. **168 tok/s on A100.** ## 📊 Compression ![Compression](assets/compression.png) | Metric | Value | |--------|-------| | 📦 Format | CompressedTensors INT4 symmetric (gs=128) | | 💾 Model size | **15.2 GB** (3 shards) | | 📉 Compression | **72%** (54 → 15.2 GB) | | ⚡ Kernel | Marlin (fused dequant+matmul) | | 🏗️ Architecture | Qwen3.5 hybrid — 64 layers (48 GDN + 16 Full Attention) | | 🔢 Parameters | 27B dense | ## 🏎️ Quick Start — One Command ```bash pip install vllm vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5 \ --language-model-only --enforce-eager ``` No plugin. No `pip install polarquant`. No custom code. ## 📊 Speed ![Speed Benchmark](assets/speed.png) | GPU | tok/s | VRAM | |-----|-------|------| | A100 80GB | **168** | 15.2 GB | | RTX PRO 6000 96GB | **18** | 15.2 GB | | RTX 4090 24GB | ~10 | 15.2 GB | ## 🎯 Quality ![Quality PPL](assets/quality_ppl.png) | Method | PPL (WikiText-2) | Delta | |--------|-------------------|-------| | BF16 baseline | 6.37 | — | | **HLWQ → INT4 (ours)** | **6.56** | **+0.19** | | Direct INT4 (naive) | 6.68 | +0.31 | **HLWQ produces better INT4 weights** than direct quantization — 0.12 PPL improvement from Hadamard rotation + Lloyd-Max preprocessing. ## 💻 GPU Compatibility ![GPU Compatibility](assets/gpu_compat.png) | GPU | VRAM | Status | |-----|------|--------| | RTX 4060 Ti | 16 GB | ⚠️ Tight (15.2 GB model + KV cache) | | RTX 4090 | 24 GB | ✅ Comfortable | | A100 / H100 | 80 GB | ✅ Full speed | | RTX PRO 6000 | 96 GB | ✅ Full speed | ## 🧬 Architecture ![Architecture](assets/architecture.png) | Spec | Value | |------|-------| | Parameters | 27B (dense) | | Layers | 64 (48 GDN + 16 Full Attention) | | Hidden dim | 3584 | | Head dim | 128 | | Context | 131,072 tokens | | Vision | Multimodal ViT (skipped with `--language-model-only`) | ## 🔬 How HLWQ Works Standard INT4 quantizes weights directly — outliers cause high error. HLWQ adds a **preprocessing step** before INT4: ``` BF16 weights │ ▼ [1] Hadamard rotation → distributes energy uniformly │ (eliminates outliers, weights become Gaussian) │ ▼ [2] Lloyd-Max Q5 → MSE-optimal 5-bit quantization │ (best possible codebook for Gaussian distribution) │ ▼ [3] Dequant → BF16 → INT4 symmetric (gs=128) │ (cleaner weights = better INT4) │ ▼ CompressedTensors (Marlin kernel) → vLLM serve ``` Same speed as GPTQ/AWQ, better quality. ## 🔧 Usage with Transformers ```bash pip install polarquant ``` ```python import polarengine_vllm # auto-registers with transformers from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5", device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-HLWQ-Q5", trust_remote_code=True ) inputs = tokenizer("Hello!", return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ## ⚠️ Important Flags | Flag | Why | |------|-----| | `--language-model-only` | Qwen3.5 is multimodal — skip vision encoder (text weights only) | | `--enforce-eager` | Required on Blackwell GPUs (cc 12.0). Optional on Ampere/Hopper | ## 📖 Citation ```bibtex @misc{hlwq2026, title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models}, author={Caio Vicentino}, year={2026}, url={https://arxiv.org/abs/2603.29078} } ``` ## 🔗 Links | Resource | Link | |----------|------| | 📄 Paper | [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) | | 🔧 Code | [GitHub](https://github.com/caiovicentino/eoq-quantization) | | 📦 PyPI | [`pip install polarquant`](https://pypi.org/project/polarquant/) | | 🏠 Base model | [Jackrong/Qwopus3.5-27B-v3](https://huggingface.co/Jackrong/Qwopus3.5-27B-v3) | | 🔀 vLLM plugin | [polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm) |