Note: These models are optimized for use within an agentic harness (e.g. Hermes Agent) and may behave unexpectedly in raw inference without a system prompt. Capability benchmarks are strong but conversational behavior outside of a structured harness is not reliable. I am currently working on v2 to address this and reduce harness dependency.

Support This Work

Im a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. Its a hobby that got out of hand. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

☕ ko-fi.com/djlougen

Qwen3.5-9B-Harmonic

Quantizations of DJLougen/Qwen3.5-9B-Harmonic, a Qwen3.5-9B model fine-tuned with Unsloth and Hugging Faces TRL library.

Benchmark Results

Code Generation

Evaluated using EvalPlus with greedy decoding (temperature=0). Served via LM Studio local inference.

Benchmark	DJL-Qwen3.5-9B-Harmonic	Qwopus3.5-9B-v3	Qwen3.5-9B (base)
HumanEval (pass@1)	87.2%	87.8%	82.9%
HumanEval+ (pass@1)	81.7%	82.9%	77.4%

Comparison models: Jackrong/Qwopus3.5-9B-v3, unsloth/Qwen3.5-9B.

MMLU-Pro (Knowledge & Reasoning)

Evaluated on 280 questions (40 per category) from the TIGER-Lab/MMLU-Pro test split using Q8_0 quantization served via Ollama. Chain-of-thought prompting with greedy decoding (temperature=0).

Benchmark	DJL-Qwen3.5-9B-Harmonic	Qwopus3.5-9B-v3
MMLU-Pro (overall)	80.36% (225/280)	81.79% (229/280)

Per-category breakdown:

Category	Accuracy
Math	92.5% (37/40)
Biology	90.0% (36/40)
Chemistry	87.5% (35/40)
Physics	87.5% (35/40)
Computer Science	75.0% (30/40)
Health	70.0% (28/40)
Other	60.0% (24/40)

Training Data

Fine-tuned on only ~800 rows of self-generated Claude responses. Analyzed the data behind the Jackrong Qwopus models to understand what worked, then generated my own training data from scratch, structured so that quality can be quantitatively checked. The reasoning traces follow a phased structure, and the distribution of reasoning effort across those phases was shaped using ideas from harmonic and Fourier analysis. The idea is that you can treat reasoning depth allocation like a signal processing problem, where different phases of thought get weighted according to frequency-domain characteristics of the problem structure. ~800 rows turned out to be more than enough when the shape of the reasoning is right.

Base Model

Architecture: Qwen3.5 (Qwen3_5ForConditionalGeneration) — hybrid linear + full attention with Mamba SSM components
Base: unsloth/Qwen3.5-9B
Parameters: ~9B
Hidden Size: 4,096
Layers: 32 (24 linear attention + 8 full attention, every 4th layer)
Attention Heads: 16 (4 KV heads, GQA)
Head Dim: 256
Intermediate Size: 12,288
Activation: SiLU
Context Length: 262,144 tokens
Vocab Size: 248,320
Precision: bfloat16
License: Apache 2.0
Vision: Qwen3.5 vision encoder (1152-dim, 27-layer, patch size 16)
Chat Template: ChatML (<|im_start|> / <|im_end|>)
Multimodal Tokens: image (248056), video (248057)
RoPE: Multimodal RoPE (mRoPE) with interleaved sections [11, 11, 10], theta = 10,000,000

Layer Architecture Detail

Qwen3.5-9B uses a hybrid attention pattern mixing standard full attention with linear attention layers that include a Mamba-style SSM component (conv kernel dim = 4, SSM dtype float32):

Full attention layers: every 4th layer (layers 3, 7, 11, 15, 19, 23, 27, 31)
Linear attention layers: all remaining layers (24 of 32)
Linear attention config: 16 key heads (dim 128), 32 value heads (dim 128)
Partial rotary factor: 0.25

Quantizations

All quantizations produced with llama.cpp. IQ quants use an importance matrix computed from WikiText-2 calibration data.

427 tensors per file.

Quant	Size	BPW	Notes
F16	16.69 GB	16.00	Full precision. Use if you have the VRAM/RAM.
Q8_0	8.87 GB	8.51	Near-lossless.
Q6_K	6.85 GB	6.57	Excellent quality.
Q5_K_M	6.02 GB	5.77	Great balance of quality and size.
Q5_K_S	5.87 GB	5.63	Slightly smaller Q5.
Q4_K_M	5.24 GB	5.03	Popular sweet spot — minimal quality loss.
IQ4_NL	5.05 GB	4.84	Importance-matrix 4-bit — can outperform standard Q4 at similar size.
Q4_K_S	4.98 GB	4.78	Smaller Q4.
IQ4_XS	4.84 GB	4.64	Importance-matrix 4-bit, extra small.
Q3_K_M	4.31 GB	4.13	Moderate quality at 3-bit.
IQ3_M	4.11 GB	3.94	Importance-matrix 3-bit — good quality for the size.
IQ3_S	4.07 GB	3.90	Importance-matrix 3-bit, small.
Q3_K_S	3.97 GB	3.80	Smaller 3-bit.
IQ3_XXS	3.67 GB	3.52	Importance-matrix 3-bit, extra extra small.
IQ2_M	3.36 GB	3.22	Extreme compression with imatrix. Quality degrades.
IQ2_S	3.19 GB	3.06	Extreme compression.
IQ2_XXS	2.89 GB	2.77	Very aggressive compression.
IQ1_M	2.68 GB	2.57	Maximum compression. Expect significant quality loss.
IQ1_S	2.55 GB	2.45	Maximum compression. Expect significant quality loss.

Usage

llama.cpp

llama-cli -m Qwen3.5-9B-Harmonic-Q4_K_M.gguf -p "You are a helpful assistant." -cnv

LM Studio / Ollama / KoboldCpp

Download any GGUF file and load it directly.

Credits

Original Model: DJLougen/Qwen3.5-9B-Harmonic
Base Model: Qwen Team — unsloth/Qwen3.5-9B
Fine-Tuning Framework: Unsloth
Quantization Tooling: llama.cpp

Downloads last month: 11,500

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for DJLougen/Qwen3.5-9B-Harmonic-GGUF

Base model

DJLougen/Qwen3.5-9B-Harmonic

Quantized

(1)

this model

Evaluation results

pass@1 on HumanEval
self-reported

87.200
pass@1 on HumanEval+
self-reported

81.700
accuracy on MMLU-Pro
self-reported

80.360