Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Qwen3.6-35B-A3B quantized with APEX imatrix built from real Hermes agent session traces.

Most GGUF quants calibrate on generic text. This one calibrates on actual agentic workloads — tool calls, multi-turn reasoning, code generation, and task completions from production Hermes agent sessions. If you run local agents, the quantization importance weights reflect your actual inference distribution.

21 GB on disk. Runs on M1 Max (64GB) with large context headroom.
~42 tok/s generation, ~620 tok/s prompt processing on Apple Silicon (Metal).
Vision + video support included (mmproj).

Model Description

Built from Qwen/Qwen3.6-35B-A3B — a 35B hybrid MoE model with 256 experts, 8 active per token, 34.66B total parameters. Combines attention layers with Gated Delta Net SSM layers (full attention every 4th layer), trained to 262K context.

This GGUF applies APEX quantization with a custom imatrix built entirely from Hermes agent session traces — real multi-turn conversations including tool calls, reasoning chains, code generation, and agentic task completions. No generic wikitext. Verified for use with Hermes Agent on Apple Silicon (M1 Max).

Credits & Attribution

Base Model: Qwen/Qwen3.6-35B-A3B
- Original Qwen3.6 MoE release by Qwen team
Calibration Dataset: Combined imatrix calibration
- bartowski's calibration dataset v3 — high-quality general calibration base
- Hermes agent session traces — real multi-turn agentic conversations: tool calls, reasoning chains, code generation, scientific queries
- Combined and extracted using extract_hermes_traces.py with Qwen3.6 chat template
APEX Quantization: mudler/apex-quant
- Reference: mudler/Qwen3.5-35B-A3B-APEX-GGUF
TurboQuant backend: Custom Metal kernels for M-series Apple Silicon
- 3.5× faster TQ4_1S kernel, MoE 256-expert kernel instantiations
This release: @luffydenolan
- Built Hermes agent trace calibration dataset
- Applied APEX iQuality quantization using Hermes imatrix
- Local testing and verification on M1 Max (64GB)

Methodology

Started from Qwen/Qwen3.6-35B-A3B base (34.66B params, ~65GB f16, ~34GB Q8_0)
Built custom imatrix calibration combining:
- bartowski's calibration dataset v3 — general-purpose high-quality base
- Hermes agent session traces — real multi-turn agentic conversations: tool calls, reasoning chains, code generation, task completions
Generated imatrix importance weights using llama-imatrix on Q8_0:
- -c 512, --chunks 200, --threads 10, -ngl 99
Applied APEX quantization guided by imatrix weights → iQuality output (21GB)
Tested locally on M1 Max (64GB) with TurboQuant Metal backend

Architecture Notes

Qwen3.6-35B-A3B is a hybrid MoE + SSM model:

40 layers total, full attention every 4th layer (10 attention, 30 SSM/MoE)
256 experts per MoE layer, 8 activated per token (~3B active params per forward pass)
SSM: Gated Delta Net, inner size 4096, state size 128, 16 groups
GQA: 16 attention heads, 2 KV heads (8× GQA), head dim 256
Context trained to 262K tokens (rope freq base 10M)

Files Included

Qwen3.6-35B-A3B-apex-iquality.gguf — Main model weights (21 GB)
mmproj-F16-Qwen3.6-35B-A3B.gguf — Multimodal projection layer, vision + video (858 MB)

Usage

llama.cpp server (text only)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

llama.cpp server (vision/video)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  --mmproj mmproj-F16-Qwen3.6-35B-A3B.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

Recommended sampling parameters

Parameter	Value	Notes
temperature	0.7
top-k	20	Qwen3.x default
top-p	0.95
min-p	0.0
presence-penalty	0.0
repeat-penalty	1.0
cache-type-k	q8_0	Best attention accuracy on M-series
cache-type-v	turbo4	Good compression, less sensitive than K
ctx-size	132000	~half of trained 262K — fits 64GB with headroom

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.6-35B-A3B-apex-iquality.gguf",
    n_ctx=132000,
    n_gpu_layers=99,
    n_threads=4,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response["choices"][0]["message"]["content"])

Performance

Tested on Apple M1 Max (64GB unified memory), llama.cpp via Metal + TurboQuant backend, 4 threads.

Throughput (llama-bench, 3 runs, solo — no other GPU load)

Test	Tokens/sec
Prompt processing — pp128	360.60 ± 13.95
Prompt processing — pp512	620.68 ± 2.74
Prompt processing — pp1024	611.77 ± 6.17
Token generation — tg128	42.24 ± 0.04
Token generation — tg512	41.87 ± 0.22

Model size: 21 GiB on disk, loaded to GPU via Metal
Params: 34.66B total (~3B active per token)
Quant: APEX iQuality (imatrix, Hermes traces)
Backend: Metal (GPU) + TurboQuant, Apple M1 Max
KV cache: q8_0 K / turbo4 V (10 attention layers only, 132K ctx)

TurboQuant Build (Apple Silicon)

turbo4 KV cache requires TheTom's TurboQuant fork of llama.cpp.

https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

Memory at 132K context (M1 Max 64GB)

Component	Size
Model weights (APEX iQuality)	~21 GB
KV cache (q8_0 K / turbo4 V, 10 attn layers)	~2 GB
SSM recurrent state	~0.5 GB
Metal buffers + overhead	~3 GB
Total RSS	~26.6 GB

~37 GB headroom remaining on 64GB for OS + other processes.

License

Apache 2.0 (inherited from base model)

Citation

@misc{qwen36-35b-a3b-apex-iquality-gguf,
  title = {Qwen3.6-35B-A3B-APEX-IQuality-GGUF},
  author = {luffydenolan},
  year = {2026},
  url = {https://huggingface.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF}
}

Related Models

Qwen/Qwen3.6-35B-A3B — Base model
mudler/Qwen3.5-35B-A3B-APEX-GGUF — APEX quantization reference

Downloads last month: 1,321

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(161)

this model