Qwen3.6-35B-A3B — UD-Q4_K_XL (mlx-node)
4-bit base mixed-precision quantization of Qwen/Qwen3.6-35B-A3B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.
| Original (BF16) | This Model | |
|---|---|---|
| Size | ~66 GB | 22 GB |
| Format | SafeTensors (sharded) | SafeTensors (sharded) |
| Precision | BF16 uniform | Mixed 4/…/8-bit + BF16 |
All Variants
| Repo | GGUF Equivalent | Size | Decode (tok/s) | Speedup vs BF16 |
|---|---|---|---|---|
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q2_K_XL-mlx | UD-Q2_K_XL | 14 GB | 99.2 | 2.42x |
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q3_K_XL-mlx | UD-Q3_K_XL | 18 GB | 83.6 | 2.04x |
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx | UD-Q4_K_XL | 22 GB | 80.9 | 1.97x |
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q5_K_XL-mlx | UD-Q5_K_XL | 26 GB | 73.8 | 1.80x |
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q6_K_XL-mlx | UD-Q6_K_XL | 31 GB | 73.9 | 1.80x |
| Brooooooklyn/Qwen3.6-35B-A3B-UD-Q8_K_XL-mlx | UD-Q8_K_XL | 36 GB | 73.0 | 1.78x |
Benchmarked on Apple M3 Max 128GB via examples/lm.ts (Turn 4 steady-state decode).
Performance
| Model | Size | Decode (tok/s) | Speedup |
|---|---|---|---|
| BF16 (unquantized) | 66 GB | 41.0 | baseline |
| This model (UD-Q4_K_XL) | 22 GB | 80.9 | 1.97x faster |
Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only 8 of 256 experts per token (~3B active out of 35.9B total).
Per-Tensor Bit Assignments (N=4)
| Weight | Bits | Rationale |
|---|---|---|
embed_tokens |
6-bit | KLD ~0.15 — very low sensitivity |
lm_head |
8-bit | KLD ~0.05 — safest tensor |
self_attn.q/k/v_proj |
6-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm |
linear_attn.in_proj_qkv/z |
6-bit + AWQ | KLD ~2.9, AWQ via layernorm |
self_attn.o_proj |
bf16 | NOT AWQ-correctable |
linear_attn.out_proj |
bf16 | KLD ~6.0 — worst tensor |
down_proj |
5-bit | "Slightly more sensitive" |
gate_proj, up_proj |
4-bit | base bits |
| Router gates | 8-bit | MoE routing accuracy |
| GDN params (A_log, etc) | bf16 | State-space dynamics |
Quantization Strategy
Based on Unsloth Dynamic 2.0 per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while the bulk of FFN expert weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).
AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 6-bit via input_layernorm. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer.
Architecture
| Parameter | Value |
|---|---|
| Total parameters | 35.9B (3B active per token) |
| Hidden size | 2,048 |
| Layers | 40 (30 linear + 10 full attention) |
| Attention heads | 16 (2 KV heads, GQA 8:1) |
| Head dimension | 256 |
| Experts | 256 per MoE layer, top-8 routing |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |
Usage
import { loadSession } from '@mlx-node/lm';
const session = await loadSession('./Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx');
for await (const event of session.sendStream('Explain the hybrid attention mechanism in Qwen3.6.', {
config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
if (!event.done) process.stdout.write(event.text);
}
How It Was Made
mlx convert \
-i Qwen3.6-35B-A3B \
-o Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx \
-q --q-bits 4 --q-recipe unsloth \
--imatrix-path imatrix_unsloth.gguf
Acknowledgments
- Unsloth — Quantization strategy based on their per-layer KLD benchmarks and Dynamic 2.0 methodology
- Qwen Team — For the Qwen3.6 model family
- Apple MLX — For the Metal-accelerated ML framework
License
Apache 2.0 (inherited from base model).
- Downloads last month
- 704
4-bit
Model tree for Brooooooklyn/Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
Base model
Qwen/Qwen3.6-35B-A3B