My RTX 3090 ran out of excuses: Qwen3.6-35B-A3B

#37

by Kukedlc - opened 1 day ago

•

I've been testing every model that comes out since GPT-2. No academic benchmarks, no MMLU, no HumanEval. My benchmark is the 24GB of VRAM on my RTX 3090 and the real tasks I need to solve day to day as a data scientist. Every new model that dropped I'd
download it, run it, throw my tasks at it, and always end up with the same feeling: cool but not enough. Years of this.

Qwen3.6-35B-A3B with Unsloth's Q3 quant takes 23GB of VRAM, runs at 120 tok/s and it's the first one that saturated my benchmark. I have about ten different skills I throw at every model I test. Full Power BI dashboards using Microsoft's MCP server with a
custom piece of mine for chart generation: nailed it. Causal inference tasks: nailed it. Interactive benchmarks where it has to iterate on what it sees on screen: nailed it. Multi-step web search with cross-constraints: nailed it. I've been running it for three days through OpenCode and so far it hasn't let me down on anything I've thrown at it. Too early to call it a daily driver, but the first impression is stronger than anything I've tested locally before.

To be clear, it's not magic. Several tasks I had to reinforce prompts, adapt my skills to its level of comprehension, build tools to cover gaps that models like Claude Opus solve one-shot without blinking. But that's exactly what's interesting: the distance
between "needs adaptation" and "can't do it" is massive, and this model is firmly on the right side of that line. It's the first time with an open source model where I feel like the bottleneck is me writing better prompts and not the model failing to
understand what I'm asking. After years of testing everything that came out and enduring the frustration of models that promised a lot in papers and delivered little in the terminal, getting to this point running offline on my desk feels like a point of no
return.

Kudos to the Qwen team and Unsloth for making this happen on consumer hardware. My llama.cpp config for anyone wanting to replicate:

llama-server
--model Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
-ngl 999 -fa on --no-mmap
-c 262144 -n 32768 --no-context-shift
--jinja --reasoning-format deepseek --reasoning-budget 4096
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
--presence-penalty 0.0
--cache-type-k bf16 --cache-type-v bf16
--port 8181

ulymp

about 22 hours ago

May I ask why you are not using Q4 quants and the kv cache quantization set to q8_0? In my experience the q8 KV cache quantization doesn't bring any quality loss, but saves quite a lot of memory.

pokexpert

about 20 hours ago

My setup is q4_k_m and rotorquant with planar3/turbo3. 262k at q4 power + opencode. Speed is not the best there is but its there

n_tokens = 128963
prompt eval time =    1694.57 ms /  1955 tokens (    0.87 ms per token,  1153.68 tokens per second)
       eval time =  229320.74 ms /  7501 tokens (   30.57 ms per token,    32.71 tokens per second)
      total time =  231015.32 ms /  9456 tokens

ghostwithahat

about 15 hours ago

•

edited about 14 hours ago

Another 3090 user here:

llama-server
--min-p 0.0
--jinja
--chat-template-file /opt/models/Qwen3.6-35B-A3B-heretic/chat_template.jinja
--cache-type-k turbo4
--cache-type-v turbo4
--threads 16
--flash-attn on
--model /opt/models/Qwen3.6-35B-A3B-heretic/Qwen3.6-35B-A3B-heretic.IQ4_NL.gguf
--ctx-size 262144
--n-gpu-layers 99
--temp 0.6
--top-p 0.95
--top-k 20
--repeat-penalty 1.0
--repeat-last-n 256
--perf

For normal "chatting", I like the big dense gemma 4 better. But Qwen3.6 seems to work better for agentic use.
BTW: I had a lot of deadlocks with hermes-agent on Qwen3.6. I had to set config.memory.nudge_interval=0 and config.memory.flush_min_turns=0 to fix it.

owao

about 12 hours ago

•

edited about 8 hours ago

https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks

Using UD-Q5_K_XL (on 3090 too, 131K context ~75t/s @10K , ~65t/s @120K), I feel exactly the same! A bit more of user prompt explicitness is required sometimes, but once it's on its rails, it goes to the final destination!
Just wanted to add that --chat-template-kwargs '{"preserve_thinking":true}' has been beneficial for me when it comes to autonomous agentic tasks (with t 0.6 and no presence penalty), give it a try!

Congrats to the team who designed the training, really great job.

owao

about 9 hours ago

•

edited about 9 hours ago

Also, as a tip, when using llama.cpp with MOE models, you can go with much higher quants (Q5_XL is 26.6GB). The magic is not to set n-gpu-layers (as you of course would for a dense model for max speed) to let llama.cpp do its own offloading optimization instead. Just set your --ctx-size and that's all. You will get a good speed even when the GGUF size exceed the VRAM size.
PS: the llama server log will continue to say "offloaded 41/41 layers to GPU" which can be confusing I know don't ask me.

owao

about 8 hours ago

About the new kv cache performance: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150833469
As @ulymp mentionned q8 now has greatly improved: AIME25 eval https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment