Qwen3.6-35B-A3B-APEX-IQuality-GGUF
Qwen3.6-35B-A3B quantized with APEX imatrix built from real Hermes agent session traces.
Most GGUF quants calibrate on generic text. This one calibrates on actual agentic workloads โ tool calls, multi-turn reasoning, code generation, and task completions from production Hermes agent sessions. If you run local agents, the quantization importance weights reflect your actual inference distribution.
- 21 GB on disk. Runs on M1 Max (64GB) with large context headroom.
- ~42 tok/s generation, ~620 tok/s prompt processing on Apple Silicon (Metal).
- Vision + video support included (mmproj).
Model Description
Built from Qwen/Qwen3.6-35B-A3B โ a 35B hybrid MoE model with 256 experts, 8 active per token, 34.66B total parameters. Combines attention layers with Gated Delta Net SSM layers (full attention every 4th layer), trained to 262K context.
This GGUF applies APEX quantization with a custom imatrix built entirely from Hermes agent session traces โ real multi-turn conversations including tool calls, reasoning chains, code generation, and agentic task completions. No generic wikitext. Verified for use with Hermes Agent on Apple Silicon (M1 Max).
Credits & Attribution
Base Model: Qwen/Qwen3.6-35B-A3B
- Original Qwen3.6 MoE release by Qwen team
Calibration Dataset: Combined imatrix calibration
- bartowski's calibration dataset v3 โ high-quality general calibration base
- Hermes agent session traces โ real multi-turn agentic conversations: tool calls, reasoning chains, code generation, scientific queries
- Combined and extracted using
extract_hermes_traces.pywith Qwen3.6 chat template
APEX Quantization: mudler/apex-quant
- Reference: mudler/Qwen3.5-35B-A3B-APEX-GGUF
TurboQuant backend: Custom Metal kernels for M-series Apple Silicon
- 3.5ร faster TQ4_1S kernel, MoE 256-expert kernel instantiations
This release: @luffydenolan
- Built Hermes agent trace calibration dataset
- Applied APEX iQuality quantization using Hermes imatrix
- Local testing and verification on M1 Max (64GB)
Methodology
- Started from
Qwen/Qwen3.6-35B-A3Bbase (34.66B params, ~65GB f16, ~34GB Q8_0) - Built custom imatrix calibration combining:
- bartowski's calibration dataset v3 โ general-purpose high-quality base
- Hermes agent session traces โ real multi-turn agentic conversations: tool calls, reasoning chains, code generation, task completions
- Generated imatrix importance weights using
llama-imatrixon Q8_0:-c 512,--chunks 200,--threads 10,-ngl 99
- Applied APEX quantization guided by imatrix weights โ iQuality output (21GB)
- Tested locally on M1 Max (64GB) with TurboQuant Metal backend
Architecture Notes
Qwen3.6-35B-A3B is a hybrid MoE + SSM model:
- 40 layers total, full attention every 4th layer (10 attention, 30 SSM/MoE)
- 256 experts per MoE layer, 8 activated per token (~3B active params per forward pass)
- SSM: Gated Delta Net, inner size 4096, state size 128, 16 groups
- GQA: 16 attention heads, 2 KV heads (8ร GQA), head dim 256
- Context trained to 262K tokens (rope freq base 10M)
Files Included
Qwen3.6-35B-A3B-apex-iquality.ggufโ Main model weights (21 GB)mmproj-F16-Qwen3.6-35B-A3B.ggufโ Multimodal projection layer, vision + video (858 MB)
Usage
llama.cpp server (text only)
llama-server \
-m Qwen3.6-35B-A3B-apex-iquality.gguf \
-ngl 99 \
-t 4 \
--ctx-size 132000 \
--cache-type-k q8_0 \
--cache-type-v turbo4 \
--flash-attn on \
--jinja \
--chat-template-file chat_template_qwen36.jinja \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
llama.cpp server (vision/video)
llama-server \
-m Qwen3.6-35B-A3B-apex-iquality.gguf \
--mmproj mmproj-F16-Qwen3.6-35B-A3B.gguf \
-ngl 99 \
-t 4 \
--ctx-size 132000 \
--cache-type-k q8_0 \
--cache-type-v turbo4 \
--flash-attn on \
--jinja \
--chat-template-file chat_template_qwen36.jinja \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
Recommended sampling parameters
| Parameter | Value | Notes |
|---|---|---|
| temperature | 0.7 | |
| top-k | 20 | Qwen3.x default |
| top-p | 0.95 | |
| min-p | 0.0 | |
| presence-penalty | 0.0 | |
| repeat-penalty | 1.0 | |
| cache-type-k | q8_0 | Best attention accuracy on M-series |
| cache-type-v | turbo4 | Good compression, less sensitive than K |
| ctx-size | 132000 | ~half of trained 262K โ fits 64GB with headroom |
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.6-35B-A3B-apex-iquality.gguf",
n_ctx=132000,
n_gpu_layers=99,
n_threads=4,
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}]
)
print(response["choices"][0]["message"]["content"])
Performance
Tested on Apple M1 Max (64GB unified memory), llama.cpp via Metal + TurboQuant backend, 4 threads.
Throughput (llama-bench, 3 runs, solo โ no other GPU load)
| Test | Tokens/sec |
|---|---|
| Prompt processing โ pp128 | 360.60 ยฑ 13.95 |
| Prompt processing โ pp512 | 620.68 ยฑ 2.74 |
| Prompt processing โ pp1024 | 611.77 ยฑ 6.17 |
| Token generation โ tg128 | 42.24 ยฑ 0.04 |
| Token generation โ tg512 | 41.87 ยฑ 0.22 |
- Model size: 21 GiB on disk, loaded to GPU via Metal
- Params: 34.66B total (~3B active per token)
- Quant: APEX iQuality (imatrix, Hermes traces)
- Backend: Metal (GPU) + TurboQuant, Apple M1 Max
- KV cache: q8_0 K / turbo4 V (10 attention layers only, 132K ctx)
TurboQuant Build (Apple Silicon)
turbo4 KV cache requires TheTom's TurboQuant fork of llama.cpp.
https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md
Memory at 132K context (M1 Max 64GB)
| Component | Size |
|---|---|
| Model weights (APEX iQuality) | ~21 GB |
| KV cache (q8_0 K / turbo4 V, 10 attn layers) | ~2 GB |
| SSM recurrent state | ~0.5 GB |
| Metal buffers + overhead | ~3 GB |
| Total RSS | ~26.6 GB |
~37 GB headroom remaining on 64GB for OS + other processes.
License
Apache 2.0 (inherited from base model)
Citation
@misc{qwen36-35b-a3b-apex-iquality-gguf,
title = {Qwen3.6-35B-A3B-APEX-IQuality-GGUF},
author = {luffydenolan},
year = {2026},
url = {https://huggingface.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF}
}
Related Models
- Qwen/Qwen3.6-35B-A3B โ Base model
- mudler/Qwen3.5-35B-A3B-APEX-GGUF โ APEX quantization reference
- Downloads last month
- 1,321
We're not able to determine the quantization variants.
Model tree for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF
Base model
Qwen/Qwen3.6-35B-A3B