Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.

Support This Work

Im a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. Its a hobby that got out of hand. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

ko-fi.com/djlougen


Qwen3.5-9B-Harmonic

Quantizations of DJLougen/Qwen3.5-9B-Harmonic, a Qwen3.5-9B model fine-tuned with Unsloth and Hugging Faces TRL library.

Benchmark Results

Code Generation

Evaluated using EvalPlus with greedy decoding (temperature=0). Served via LM Studio local inference.

Benchmark DJL-Qwen3.5-9B-Harmonic Qwopus3.5-9B-v3 Qwen3.5-9B (base)
HumanEval (pass@1) 87.2% 87.8% 82.9%
HumanEval+ (pass@1) 81.7% 82.9% 77.4%

Comparison models: Jackrong/Qwopus3.5-9B-v3, unsloth/Qwen3.5-9B.

MMLU-Pro (Knowledge & Reasoning)

Evaluated on 280 questions (40 per category) from the TIGER-Lab/MMLU-Pro test split using Q8_0 quantization served via Ollama. Chain-of-thought prompting with greedy decoding (temperature=0).

Benchmark DJL-Qwen3.5-9B-Harmonic Qwopus3.5-9B-v3
MMLU-Pro (overall) 80.36% (225/280) 81.79% (229/280)

Per-category breakdown:

Category Accuracy
Math 92.5% (37/40)
Biology 90.0% (36/40)
Chemistry 87.5% (35/40)
Physics 87.5% (35/40)
Computer Science 75.0% (30/40)
Health 70.0% (28/40)
Other 60.0% (24/40)

Training Data

Fine-tuned on only ~800 rows of self-generated Claude responses. Analyzed the data behind the Jackrong Qwopus models to understand what worked, then generated my own training data from scratch, structured so that quality can be quantitatively checked. The reasoning traces follow a phased structure, and the distribution of reasoning effort across those phases was shaped using ideas from harmonic and Fourier analysis. The idea is that you can treat reasoning depth allocation like a signal processing problem, where different phases of thought get weighted according to frequency-domain characteristics of the problem structure. ~800 rows turned out to be more than enough when the shape of the reasoning is right.

Base Model

  • Architecture: Qwen3.5 (Qwen3_5ForConditionalGeneration) — hybrid linear + full attention with Mamba SSM components
  • Base: unsloth/Qwen3.5-9B
  • Parameters: ~9B
  • Hidden Size: 4,096
  • Layers: 32 (24 linear attention + 8 full attention, every 4th layer)
  • Attention Heads: 16 (4 KV heads, GQA)
  • Head Dim: 256
  • Intermediate Size: 12,288
  • Activation: SiLU
  • Context Length: 262,144 tokens
  • Vocab Size: 248,320
  • Precision: bfloat16
  • License: Apache 2.0
  • Vision: Qwen3.5 vision encoder (1152-dim, 27-layer, patch size 16)
  • Chat Template: ChatML (<|im_start|> / <|im_end|>)
  • Multimodal Tokens: image (248056), video (248057)
  • RoPE: Multimodal RoPE (mRoPE) with interleaved sections [11, 11, 10], theta = 10,000,000

Layer Architecture Detail

Qwen3.5-9B uses a hybrid attention pattern mixing standard full attention with linear attention layers that include a Mamba-style SSM component (conv kernel dim = 4, SSM dtype float32):

  • Full attention layers: every 4th layer (layers 3, 7, 11, 15, 19, 23, 27, 31)
  • Linear attention layers: all remaining layers (24 of 32)
  • Linear attention config: 16 key heads (dim 128), 32 value heads (dim 128)
  • Partial rotary factor: 0.25

Quantizations

All quantizations produced with llama.cpp. IQ quants use an importance matrix computed from WikiText-2 calibration data.

427 tensors per file.

Quant Size BPW Notes
F16 16.69 GB 16.00 Full precision. Use if you have the VRAM/RAM.
Q8_0 8.87 GB 8.51 Near-lossless.
Q6_K 6.85 GB 6.57 Excellent quality.
Q5_K_M 6.02 GB 5.77 Great balance of quality and size.
Q5_K_S 5.87 GB 5.63 Slightly smaller Q5.
Q4_K_M 5.24 GB 5.03 Popular sweet spot — minimal quality loss.
IQ4_NL 5.05 GB 4.84 Importance-matrix 4-bit — can outperform standard Q4 at similar size.
Q4_K_S 4.98 GB 4.78 Smaller Q4.
IQ4_XS 4.84 GB 4.64 Importance-matrix 4-bit, extra small.
Q3_K_M 4.31 GB 4.13 Moderate quality at 3-bit.
IQ3_M 4.11 GB 3.94 Importance-matrix 3-bit — good quality for the size.
IQ3_S 4.07 GB 3.90 Importance-matrix 3-bit, small.
Q3_K_S 3.97 GB 3.80 Smaller 3-bit.
IQ3_XXS 3.67 GB 3.52 Importance-matrix 3-bit, extra extra small.
IQ2_M 3.36 GB 3.22 Extreme compression with imatrix. Quality degrades.
IQ2_S 3.19 GB 3.06 Extreme compression.
IQ2_XXS 2.89 GB 2.77 Very aggressive compression.
IQ1_M 2.68 GB 2.57 Maximum compression. Expect significant quality loss.
IQ1_S 2.55 GB 2.45 Maximum compression. Expect significant quality loss.

Usage

llama.cpp

llama-cli -m Qwen3.5-9B-Harmonic-Q4_K_M.gguf -p "You are a helpful assistant." -cnv

LM Studio / Ollama / KoboldCpp

Download any GGUF file and load it directly.

Credits

Downloads last month
11,500
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DJLougen/Qwen3.5-9B-Harmonic-GGUF

Quantized
(1)
this model

Evaluation results