Mixed Precision GGUF layer quantization of gemma-4-26B-A4B-it by Google

Original model: https://huggingface.co/google/gemma-4-26B-A4B-it

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:

Q4_K_L : Q4_K_M + attn_o = q6_k
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K

   LAYER_TYPES='[
   [0 ,"Q5_K_M"],[1 ,"Q4_K_M"],[2 ,"Q4_K_S"],[3 ,"Q3_K_L"],[4 ,"Q4_K_S"],[5 ,"Q3_K_L"],[6 ,"Q4_K_S"],[7 ,"Q3_K_L"],
   [8 ,"Q4_K_S"],[9 ,"Q3_K_L"],[10,"Q4_K_S"],[11,"Q3_K_L"],[12,"Q4_K_S"],[13,"Q4_K_S"],[14,"Q4_K_S"],[15,"Q4_K_S"],
   [16,"Q4_K_M"],[17,"Q4_K_S"],[18,"Q4_K_M"],[19,"Q4_K_S"],[20,"Q4_K_M"],[21,"Q4_K_M"],[22,"Q4_K_M"],[23,"Q4_K_M"],
   [24,"Q4_K_M"],[25,"Q4_K_L"],[26,"Q5_K_S"],[27,"Q5_K_M"],[28,"Q5_K_L"],[29,"Q6_K_S"]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"

The quant was tested for very strong performance over a small set of curated reasoning prompts and sized to slightly smaller than Q4_K_M with minumum quant across layers at Q3_K_L. It solved almost the entire set of eval prompts correctly without using thinking block, performing noticeably better than the 31B dense model on some problems. It did miss an IQ test prompt without thinking but got it with think block turned on.

A slightly larger Q6_K_H quant is also available:

Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0

   LAYER_TYPES='[
   [0 ,"Q6_K_S"],[1 ,"Q5_K_M"],[2 ,"Q4_K_M"],[3 ,"Q4_K_S"],[4 ,"Q4_K_M"],[5 ,"Q4_K_S"],[6 ,"Q4_K_M"],[7 ,"Q4_K_S"],
   [8 ,"Q4_K_M"],[9 ,"Q4_K_S"],[10,"Q4_K_M"],[11,"Q4_K_S"],[12,"Q4_K_M"],[13,"Q4_K_S"],[14,"Q4_K_M"],[15,"Q4_K_S"],
   [16,"Q4_K_M"],[17,"Q4_K_M"],[18,"Q4_K_M"],[19,"Q4_K_M"],[20,"Q4_K_M"],[21,"Q4_K_L"],[22,"Q4_K_M"],[23,"Q4_K_L"],
   [24,"Q5_K_S"],[25,"Q5_K_M"],[26,"Q5_K_L"],[27,"Q6_K_S"],[28,"Q6_K_M"],[29,"Q6_K_L"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"

This quant uses a minumum Q4_K_S across layers, Q6_K embeddings, and and a strong Q6_K_L final layer. The quant efficiently aced the entire set of eval prompts with no think block turned on (except for the trick test question which most models miss, but it gets it with think enabled). It also does a better job with code and is sized at a very signficant 5G smaller than Q6_K to enable largest possible context size when fully offloaded into 24G VRAM.

Comparison:

Quant size PPL Comment
Q4_K_M 16.8e9 15.1 modified PPL, see discussion below.
Q4_K_H 16.1e9 14.8 modified PPL, 0.7B smaller than Q4_K_M
Q6_K 22.6e9 13.7 modified PPL
Q6_K_H 17.5e9 14.2 modified PPL, 5G smaller than Q6_K

Usage:

gemma 4 26B A4B it is a vision capable moe RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.

Thinking:

By default the model will not create a RL reasoning block and just outputs

<|channel>thought
<channel|>

at the start of gen. To get it to fill in the think block use a system prompt with:

<|think|>

as the first token. This is a special token in the model vocab and must be tokenized as such to work. No other text in the system prompt besides the think token is needed to get it to fill in the RL block though other text can be added if desired.

The model was found to be highly capable on reasoning tasks when skipping think block. However on hard or trick questions the model can just wing a bogus response in non think mode, but turning on RL normally fixed it on the test eval prompts used.

Running:

Use of speculation with the model is not recommended since it is a moe with 4B activated experts meaning it can be efficiently run on CPU, but CPU does not have enough parallel hardware to leverage processing many tokens in a batch at once. Thus gen rate will go down if speculation is used with experts processed on CPU. If the model is run fully on GPU speculation can be used with the model. A recommended low overhead speculator is gemma-3-270m-it-256k. To use this speculator the inference platform must support dynamic vocab translation between draft and target. Very low performance gains are found with speculation possibly due to some interaction with the moe experts that the dense draft cannot predict reliably.

The model can be run fully offloaded into 24G VRAM, or with CPU and expert layer offload via config OT="-ot exps=CPU -ngl 99" (recommended) Because the model is a 4B active moe the CPU expert offload still gives good gen rate with very large context available, however prompt processing will be very slow for large contexts with experts offloaded to CPU.

On a 9900k/4070 or 2x 4070 setup (1 RPC) approx performance for the Q4_K_H quant is:

CONFIG (no vision tower) QKV NKV gen tps pp tps (batch 128)
4070+ 9900k CPU exp offload F16 256k 23 ~100
2x4070 (RPC) F16 128k+ 77 ~1000
2x4070 (RPC) + spec , ND=2 F16 128k+ 84
2x4070 (RPC) Q8_0 256k 76
2x4070 (RPC) + spec , ND=2 Q8_0 256k 79

The Q4_K_H model passed two long context tests with Q8_0 QKV running on 2x4070 showing usable prompt processing speed for 100k+ token prompts. It impressively handles a 106k token prompt https://thireus.com/REDDIT/Qwen3_Runescape_Massive_Prompt.txt very efficiently:

lm Qwen3_Runescape_Massive_Prompt.txt 
<|channel>thought
<channel|>Based on the **Skills/Experience Table** provided in the text, the maximum experience level is **99**, which requires **13,034,431** experience.

Half of that experience is **6,517,215.5**.

Looking at the progression in the table, the experience required to reach level 99 is significantly higher than the experience required for any level below it. For example, level 98 requires 11,805,606 experience. Since 6,517,215.5 is much higher than the experience required for level 90 (5,346,332) but lower than the experience required for level 91 (5,902,831) and level 92 (6,517,253), you reach the halfway point of total experience at **level 92**.

This prompt is old enough that it is most likely in the training set for Gemma 4 but it still generates a very efficient solution.

Vision:

The model was tested in vision mode on a couple pretty tough bird ID image and found to exhibit poor performance in both think and nonthink mode, but did at least consider the correct bird on one of the test images. As a comparision gemma3 27B went 1 for 2 and Qwen3 27B completely aces these tough (quite blurry images of a small bird) ID tests. The model did a great job on some text based image prompts though.

Code:

The Q4_K_H model was tested across a small set of code gen prompts and found to be OK in its ability to generate working code on the test prompts. It can generate working code but failed about 1/2 of the test cases. The Q6_K_H model does a better job with code and performs about the same as Q6_K, which still does not do a good job one of the test prompts.

Llama.cpp inference/isssues:

Llama.cpp minimum version to run gemma-4-31B-it should be b8648 and above due to correction of the Gemma 4 tokenizer.

The model cannot compute valid perplexity due to the instruct tune forcing it to generate

<|channel>thought

as assitant gen independent of previous prompt contents. To work around this problem a modifed perplexity is computed by overwriting the beginning of the perplexity chunk contents with the forced assistent gen as follows:

      # chunk is a string of text to eval perplexity on
      injects='model\n<|channel>thought\n<channel|>'
      chunk="${injects}${chunk:${#injects}}"

logprobs are skipped over the beginning part of the perplexity prompt using a modified llama.cpp downstream server to compute perplexity. Discussion at: https://github.com/ggml-org/llama.cpp/issues/21388

Benchmarks:

A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
gemma-4-26B-A4B-it.Q4_K_H.gguf Q4_K_H 16.1e9 B 0.7B smaller than Q4_K_M
gemma-4-26B-A4B-it.Q6_K_H.gguf Q6_K_H 17.5e9 B 5B smaller than Q6_K
gemma-4-26B-A4B-it.mmproj.gguf F16 1.2e9 B multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
822
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steampunque/gemma-4-26B-A4B-it-MP-GGUF

Quantized
(145)
this model