Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.
Support This Work
Im a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. Its a hobby that got out of hand. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Qwen3.5-9B-Harmonic
Quantizations of DJLougen/Qwen3.5-9B-Harmonic, a Qwen3.5-9B model fine-tuned with Unsloth and Hugging Faces TRL library.
Benchmark Results
Code Generation
Evaluated using EvalPlus with greedy decoding (temperature=0). Served via LM Studio local inference.
| Benchmark | DJL-Qwen3.5-9B-Harmonic | Qwopus3.5-9B-v3 | Qwen3.5-9B (base) |
|---|---|---|---|
| HumanEval (pass@1) | 87.2% | 87.8% | 82.9% |
| HumanEval+ (pass@1) | 81.7% | 82.9% | 77.4% |
Comparison models: Jackrong/Qwopus3.5-9B-v3, unsloth/Qwen3.5-9B.
MMLU-Pro (Knowledge & Reasoning)
Evaluated on 280 questions (40 per category) from the TIGER-Lab/MMLU-Pro test split using Q8_0 quantization served via Ollama. Chain-of-thought prompting with greedy decoding (temperature=0).
| Benchmark | DJL-Qwen3.5-9B-Harmonic | Qwopus3.5-9B-v3 |
|---|---|---|
| MMLU-Pro (overall) | 80.36% (225/280) | 81.79% (229/280) |
Per-category breakdown:
| Category | Accuracy |
|---|---|
| Math | 92.5% (37/40) |
| Biology | 90.0% (36/40) |
| Chemistry | 87.5% (35/40) |
| Physics | 87.5% (35/40) |
| Computer Science | 75.0% (30/40) |
| Health | 70.0% (28/40) |
| Other | 60.0% (24/40) |
Training Data
Fine-tuned on only ~800 rows of self-generated Claude responses. Analyzed the data behind the Jackrong Qwopus models to understand what worked, then generated my own training data from scratch, structured so that quality can be quantitatively checked. The reasoning traces follow a phased structure, and the distribution of reasoning effort across those phases was shaped using ideas from harmonic and Fourier analysis. The idea is that you can treat reasoning depth allocation like a signal processing problem, where different phases of thought get weighted according to frequency-domain characteristics of the problem structure. ~800 rows turned out to be more than enough when the shape of the reasoning is right.
Base Model
- Architecture: Qwen3.5 (
Qwen3_5ForConditionalGeneration) — hybrid linear + full attention with Mamba SSM components - Base: unsloth/Qwen3.5-9B
- Parameters: ~9B
- Hidden Size: 4,096
- Layers: 32 (24 linear attention + 8 full attention, every 4th layer)
- Attention Heads: 16 (4 KV heads, GQA)
- Head Dim: 256
- Intermediate Size: 12,288
- Activation: SiLU
- Context Length: 262,144 tokens
- Vocab Size: 248,320
- Precision: bfloat16
- License: Apache 2.0
- Vision: Qwen3.5 vision encoder (1152-dim, 27-layer, patch size 16)
- Chat Template: ChatML (
<|im_start|>/<|im_end|>) - Multimodal Tokens: image (248056), video (248057)
- RoPE: Multimodal RoPE (mRoPE) with interleaved sections [11, 11, 10], theta = 10,000,000
Layer Architecture Detail
Qwen3.5-9B uses a hybrid attention pattern mixing standard full attention with linear attention layers that include a Mamba-style SSM component (conv kernel dim = 4, SSM dtype float32):
- Full attention layers: every 4th layer (layers 3, 7, 11, 15, 19, 23, 27, 31)
- Linear attention layers: all remaining layers (24 of 32)
- Linear attention config: 16 key heads (dim 128), 32 value heads (dim 128)
- Partial rotary factor: 0.25
Quantizations
All quantizations produced with llama.cpp. IQ quants use an importance matrix computed from WikiText-2 calibration data.
427 tensors per file.
| Quant | Size | BPW | Notes |
|---|---|---|---|
| F16 | 16.69 GB | 16.00 | Full precision. Use if you have the VRAM/RAM. |
| Q8_0 | 8.87 GB | 8.51 | Near-lossless. |
| Q6_K | 6.85 GB | 6.57 | Excellent quality. |
| Q5_K_M | 6.02 GB | 5.77 | Great balance of quality and size. |
| Q5_K_S | 5.87 GB | 5.63 | Slightly smaller Q5. |
| Q4_K_M | 5.24 GB | 5.03 | Popular sweet spot — minimal quality loss. |
| IQ4_NL | 5.05 GB | 4.84 | Importance-matrix 4-bit — can outperform standard Q4 at similar size. |
| Q4_K_S | 4.98 GB | 4.78 | Smaller Q4. |
| IQ4_XS | 4.84 GB | 4.64 | Importance-matrix 4-bit, extra small. |
| Q3_K_M | 4.31 GB | 4.13 | Moderate quality at 3-bit. |
| IQ3_M | 4.11 GB | 3.94 | Importance-matrix 3-bit — good quality for the size. |
| IQ3_S | 4.07 GB | 3.90 | Importance-matrix 3-bit, small. |
| Q3_K_S | 3.97 GB | 3.80 | Smaller 3-bit. |
| IQ3_XXS | 3.67 GB | 3.52 | Importance-matrix 3-bit, extra extra small. |
| IQ2_M | 3.36 GB | 3.22 | Extreme compression with imatrix. Quality degrades. |
| IQ2_S | 3.19 GB | 3.06 | Extreme compression. |
| IQ2_XXS | 2.89 GB | 2.77 | Very aggressive compression. |
| IQ1_M | 2.68 GB | 2.57 | Maximum compression. Expect significant quality loss. |
| IQ1_S | 2.55 GB | 2.45 | Maximum compression. Expect significant quality loss. |
Usage
llama.cpp
llama-cli -m Qwen3.5-9B-Harmonic-Q4_K_M.gguf -p "You are a helpful assistant." -cnv
LM Studio / Ollama / KoboldCpp
Download any GGUF file and load it directly.
Credits
- Original Model: DJLougen/Qwen3.5-9B-Harmonic
- Base Model: Qwen Team — unsloth/Qwen3.5-9B
- Fine-Tuning Framework: Unsloth
- Quantization Tooling: llama.cpp
- Downloads last month
- 11,500
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for DJLougen/Qwen3.5-9B-Harmonic-GGUF
Base model
DJLougen/Qwen3.5-9B-HarmonicEvaluation results
- pass@1 on HumanEvalself-reported87.200
- pass@1 on HumanEval+self-reported81.700
- accuracy on MMLU-Proself-reported80.360