Osaurus AI

Qwen 3.6 35B-A3B — JANGTQ2 (MLX)

TurboQuant codebook quantization of Alibaba's hybrid linear/full-attention agentic MoE — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine, vision tower preserved.

Website  OsaurusAI


Model Details

Property Value
Base model Qwen/Qwen3.6-35B-A3B
Parameters (source) 35 B total, ~3 B active per token
Architecture qwen3_5_moe — 40 decoder layers: 30 Gated DeltaNet (linear attn) + 10 full attention, 256 routed experts + 1 always-on shared expert
Quantization format weight_format: mxtq — routed experts via TurboQuant codebook (2-bit), everything else affine 8-bit or fp16 passthrough
Routed-expert storage .tq_packed (uint32) + .tq_norms (fp16) + .tq_bits (uint8); codebook + Hadamard signs re-derived deterministically at load
Package size on disk 11.63 GB across 12 shards
Shipped tensors 1,930 total (1,597 language-model + 333 vision tower + 120 routed-expert TQ triples)
Vocab 248,320
Context (position embeddings) 262,144 native; the upstream model card reports up to ~1 M with YaRN scaling
Vision tower 27-layer ViT (hidden 1152, patch 16), preserved in fp16
Chat format Qwen im_start/im_end, unified thinking toggle

Quantization details, per tensor category

Category Bits Group / codebook Notes
Routed-expert MLP (mlp.experts.gate_up_proj, down_proj) 2 (JANGTQ) 2^2 Lloyd-Max centroids + Hadamard rotation .tq_packed + .tq_norms + .tq_bits triples
Embedding (embed_tokens), lm_head 8 (affine) group 64 MLX-native QuantizedLinear
Full-attention projections (q_proj, k_proj, v_proj, o_proj) 8 (affine) group 64 Gate-doubled q_proj for attn_output_gate
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) 8 (affine) group 64 Gated DeltaNet
Shared-expert MLP (gate_proj, up_proj, down_proj) 8 (affine) group 64 Always active per token
Router (mlp.gate) fp16 passthrough Precision-critical
Shared-expert gate (shared_expert_gate) fp16 passthrough sigmoid scalar gate
Norms (*_layernorm, *_norm), A_log, dt_bias, conv1d fp16 passthrough Un-quantized
Vision tower (333 tensors) fp16 passthrough patch_embed.proj axes pre-transposed to MLX layout

JANGTQ ("TurboQuant") stores routed-expert weights as indices into a small Lloyd-Max codebook with a per-row norm, after a randomized Hadamard rotation that concentrates the distribution so quantization error is uniform. At inference, the input is rotated once per layer (cheap fused Metal kernel) and dot products happen against the codebook centroids directly, so we never dequantize back to affine. Compared to affine 2-bit at the same bit budget, this gives better quality AND faster decode on the routed-expert MLP path.


Usage

JANGTQ requires our custom loader — stock mlx_lm.load() can't parse .tq_packed tensors. You need jang-tools (free, public): https://github.com/jjang-ai/jangq.

pip install mlx mlx-lm mlx-vlm
git clone https://github.com/jjang-ai/jangq && pip install -e ./jangq/jang-tools

Text

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("OsaurusAI/Qwen3.6-35B-A3B-JANGTQ2")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=64))

Image (VLM)

from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Qwen3.6-35B-A3B-JANGTQ2"
model, processor = load_jangtq_vlm_model(path)
config = load_config(path)

prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
print(generate(model, processor, prompt, image="path/to/image.jpg", max_tokens=200))

Reasoning toggle

msgs = [{"role": "user", "content": "What is 17 × 23?"}]
# Reasoning OFF — pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=False)
# Reasoning ON — model fills the <think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=True)

Pass enable_thinking as a direct kwarg (the chat_template_kwargs={...} form only propagates on some tokenizer versions).

Video

The base model supports video via transformers and the bundle preserves video_preprocessor_config.json. mlx-vlm 0.4.4's prepare_inputs has no video path yet for qwen3_5_moe — the Python load_jangtq_vlm path wraps video via a custom processor for our test harness. For mainline mlx-vlm users, stick to image input; use upstream transformers for video.


Hardware notes

~12 GB on disk; expect ~12–14 GB resident after load, plus KV cache.

Mac unified RAM Works? Notes
24 GB ✅ comfortable Full 32 k context OK
32 GB 32-100 k context depending on profile
24 GB text-only, short context

Benchmarks

Base-model reference (Qwen/Qwen3.6-35B-A3B, upstream, not this quant):

MMLU-Pro AIME 2026 LiveCodeBench v6 GPQA SWE-bench Verified
85.2 92.7 80.4 86.0 73.4

Independent JANGTQ-quant evaluation is tracked in the jang-tools repo and will land in future README revisions.


Citation

@misc{qwen2026qwen36,
  title  = {Qwen3.6-Plus: Towards Real World Agents},
  author = {Qwen Team},
  year   = {2026},
  url    = {https://qwen.ai/blog?id=qwen3.6}
}

License

Apache 2.0 — inherits from the base model.


Packaged on Apple Silicon with jang-tools (mlx-lm 0.31.2).
© 2026 Osaurus AI — osaurus.ai

Downloads last month
917
Safetensors
Model size
3B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Qwen3.6-35B-A3B-JANGTQ2

Finetuned
(52)
this model