Qwen3.6-35B-A3B β€” Strix Halo Optimised GGUFs

Dynamic mixed-precision GGUF quantizations of Qwen/Qwen3.6-35B-A3B, produced and benchmarked on a Framework Desktop with AMD Ryzen AI MAX+ 395 (Radeon 8060S, gfx1151, 128 GB UMA) running Vulkan via llama.cpp.

Variants

File Size prefill (t/s) decode (t/s) Notes
Qwen3.6-35B-A3B-Q8_0.gguf 35 GB 975 52.7 near-lossless reference
Qwen3.6-35B-A3B-Q6_K.gguf 27 GB 830 62.2
Qwen3.6-35B-A3B-Q5_K_M.gguf 24 GB 943 64.1
Qwen3.6-35B-A3B-Q4_K_M.gguf 20 GB 1021 70.2 production sweet spot
Qwen3.6-35B-A3B-Q4_0.gguf 19 GB 1061 76.5 fastest decode
Qwen3.6-35B-A3B-IQ4_NL.gguf 19 GB 891 73.1
Qwen3.6-35B-A3B-DYNAMIC.gguf 19 GB 1100 64.0 fastest prefill; mixed per-tensor quant

All numbers: pp=4096 tokens, tg=128 tokens; -fa 1 -ctk q8_0 -ctv q8_0 -ub 2048 -b 2048 on a single Vulkan gfx1151 device.

Dynamic mix recipe

DYNAMIC.gguf uses a per-tensor quantization map chosen for the hybrid Gated DeltaNet + Gated Attention architecture:

  • attn_k / attn_q / attn_v β†’ Q8_0 (retrieval-critical)
  • attn_output β†’ Q5_K
  • ffn_gate_inp (router) β†’ Q8_0 (routing-critical)
  • ffn_gate_exps / ffn_up_exps / ffn_down_exps (256 routed experts) β†’ IQ4_NL
  • ffn_gate_shexp / ffn_up_shexp / ffn_down_shexp (shared expert) β†’ Q6_K
  • token_embd / output β†’ Q8_0
  • everything else β†’ Q4_K_M (fallback)

Usage

llama-bench -m Qwen3.6-35B-A3B-DYNAMIC.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 \
  -ub 2048 -b 2048 -p 4096 -n 128

Benchmark context

Research series on pushing Qwen3.5/3.6 on AMD Strix Halo. Methodology, scripts, and live results: see the benchmark site referenced from the GitHub repo.

License

Apache 2.0 (inherited from base model).

Downloads last month
943
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0xSero/Qwen3.6-35B-A3B-GGUF-Strix

Quantized
(158)
this model