Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Qwen3.6-35B-A3B quantized with APEX imatrix built from real Hermes agent session traces.

Most GGUF quants calibrate on generic text. This one calibrates on actual agentic workloads โ€” tool calls, multi-turn reasoning, code generation, and task completions from production Hermes agent sessions. If you run local agents, the quantization importance weights reflect your actual inference distribution.

  • 21 GB on disk. Runs on M1 Max (64GB) with large context headroom.
  • ~42 tok/s generation, ~620 tok/s prompt processing on Apple Silicon (Metal).
  • Vision + video support included (mmproj).

Model Description

Built from Qwen/Qwen3.6-35B-A3B โ€” a 35B hybrid MoE model with 256 experts, 8 active per token, 34.66B total parameters. Combines attention layers with Gated Delta Net SSM layers (full attention every 4th layer), trained to 262K context.

This GGUF applies APEX quantization with a custom imatrix built entirely from Hermes agent session traces โ€” real multi-turn conversations including tool calls, reasoning chains, code generation, and agentic task completions. No generic wikitext. Verified for use with Hermes Agent on Apple Silicon (M1 Max).

Credits & Attribution

  • Base Model: Qwen/Qwen3.6-35B-A3B

    • Original Qwen3.6 MoE release by Qwen team
  • Calibration Dataset: Combined imatrix calibration

    • bartowski's calibration dataset v3 โ€” high-quality general calibration base
    • Hermes agent session traces โ€” real multi-turn agentic conversations: tool calls, reasoning chains, code generation, scientific queries
    • Combined and extracted using extract_hermes_traces.py with Qwen3.6 chat template
  • APEX Quantization: mudler/apex-quant

  • TurboQuant backend: Custom Metal kernels for M-series Apple Silicon

    • 3.5ร— faster TQ4_1S kernel, MoE 256-expert kernel instantiations
  • This release: @luffydenolan

    • Built Hermes agent trace calibration dataset
    • Applied APEX iQuality quantization using Hermes imatrix
    • Local testing and verification on M1 Max (64GB)

Methodology

  1. Started from Qwen/Qwen3.6-35B-A3B base (34.66B params, ~65GB f16, ~34GB Q8_0)
  2. Built custom imatrix calibration combining:
    • bartowski's calibration dataset v3 โ€” general-purpose high-quality base
    • Hermes agent session traces โ€” real multi-turn agentic conversations: tool calls, reasoning chains, code generation, task completions
  3. Generated imatrix importance weights using llama-imatrix on Q8_0:
    • -c 512, --chunks 200, --threads 10, -ngl 99
  4. Applied APEX quantization guided by imatrix weights โ†’ iQuality output (21GB)
  5. Tested locally on M1 Max (64GB) with TurboQuant Metal backend

Architecture Notes

Qwen3.6-35B-A3B is a hybrid MoE + SSM model:

  • 40 layers total, full attention every 4th layer (10 attention, 30 SSM/MoE)
  • 256 experts per MoE layer, 8 activated per token (~3B active params per forward pass)
  • SSM: Gated Delta Net, inner size 4096, state size 128, 16 groups
  • GQA: 16 attention heads, 2 KV heads (8ร— GQA), head dim 256
  • Context trained to 262K tokens (rope freq base 10M)

Files Included

  • Qwen3.6-35B-A3B-apex-iquality.gguf โ€” Main model weights (21 GB)
  • mmproj-F16-Qwen3.6-35B-A3B.gguf โ€” Multimodal projection layer, vision + video (858 MB)

Usage

llama.cpp server (text only)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

llama.cpp server (vision/video)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  --mmproj mmproj-F16-Qwen3.6-35B-A3B.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

Recommended sampling parameters

Parameter Value Notes
temperature 0.7
top-k 20 Qwen3.x default
top-p 0.95
min-p 0.0
presence-penalty 0.0
repeat-penalty 1.0
cache-type-k q8_0 Best attention accuracy on M-series
cache-type-v turbo4 Good compression, less sensitive than K
ctx-size 132000 ~half of trained 262K โ€” fits 64GB with headroom

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.6-35B-A3B-apex-iquality.gguf",
    n_ctx=132000,
    n_gpu_layers=99,
    n_threads=4,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response["choices"][0]["message"]["content"])

Performance

Tested on Apple M1 Max (64GB unified memory), llama.cpp via Metal + TurboQuant backend, 4 threads.

Throughput (llama-bench, 3 runs, solo โ€” no other GPU load)

Test Tokens/sec
Prompt processing โ€” pp128 360.60 ยฑ 13.95
Prompt processing โ€” pp512 620.68 ยฑ 2.74
Prompt processing โ€” pp1024 611.77 ยฑ 6.17
Token generation โ€” tg128 42.24 ยฑ 0.04
Token generation โ€” tg512 41.87 ยฑ 0.22
  • Model size: 21 GiB on disk, loaded to GPU via Metal
  • Params: 34.66B total (~3B active per token)
  • Quant: APEX iQuality (imatrix, Hermes traces)
  • Backend: Metal (GPU) + TurboQuant, Apple M1 Max
  • KV cache: q8_0 K / turbo4 V (10 attention layers only, 132K ctx)

TurboQuant Build (Apple Silicon)

turbo4 KV cache requires TheTom's TurboQuant fork of llama.cpp.

https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

Memory at 132K context (M1 Max 64GB)

Component Size
Model weights (APEX iQuality) ~21 GB
KV cache (q8_0 K / turbo4 V, 10 attn layers) ~2 GB
SSM recurrent state ~0.5 GB
Metal buffers + overhead ~3 GB
Total RSS ~26.6 GB

~37 GB headroom remaining on 64GB for OS + other processes.

License

Apache 2.0 (inherited from base model)

Citation

@misc{qwen36-35b-a3b-apex-iquality-gguf,
  title = {Qwen3.6-35B-A3B-APEX-IQuality-GGUF},
  author = {luffydenolan},
  year = {2026},
  url = {https://huggingface.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF}
}

Related Models

Downloads last month
1,321
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Quantized
(161)
this model