SuperGemma4 26B Abliterated Multimodal — NVFP4

NVFP4-quantized version of Jiunsong/supergemma4-26b-abliterated-multimodal — an abliterated (uncensored) Gemma 4 26B Mixture-of-Experts multimodal model with thinking/reasoning capabilities.

Quantized using NVIDIA ModelOpt 0.43 (main) with NVFP4_DEFAULT_CFG on a native Blackwell GPU. Vision encoder preserved in full BF16. Peak aggregate throughput: 1,890 tok/s @ 256 concurrent on DGX Spark (GB10).

Verified end-to-end: calibrated → exported → served on Spark → benchmarked 1-256 concurrency.

⚠️ IMPORTANT REQUIREMENTS — READ THIS FIRST

This model has non-obvious serving requirements because its per-expert-decomposed NVFP4 scale format needs specific plugin handling. Deviating from these will produce garbage output or crashes. Details below — each requirement is backed by hours of debugging.

🔴 MUST-DO requirements

#	Requirement	Why
1	Use the `-awq` container image: `ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest`	Has baked-in `modelopt.py` scale handling (PRs #1264, #1265) + patched `expert_params_mapping`. The non-`-awq` variant (or any stock vLLM) will crash or produce corrupted output.
2	Force Marlin MoE: set env `VLLM_TEST_FORCE_FP8_MARLIN=1` and `VLLM_MARLIN_USE_ATOMIC_ADD=1`	FlashInfer NVFP4 MoE backends reject the 704-per-expert intermediate dim. Do NOT set `VLLM_NVFP4_GEMM_BACKEND=marlin` — that would also force Marlin on the LINEAR path where native FLASHINFER_CUTLASS is faster. Let linear auto-select; only MoE needs Marlin.
3	Mount both patches: `gemma4_patched.py` and `serving_chat_patched.py` (2 files, NOT 3)	`modelopt_patched.py` is baked into the image — mounting a stock version on top will corrupt the scale convention. Only mount the two listed here.
4	Use `--quantization modelopt` (not `compressed-tensors`)	This checkpoint uses the modelopt NVFP4 format. `compressed-tensors` looks for different key names and will fail to load.
5	Native Blackwell GPU required (SM 10.0+)	Ampere / Ada GPUs have no native FP4 compute path. Verified on: GB10 (SM 12.0, DGX Spark), RTX PRO 6000 Blackwell (SM 12.1), should work on B200/GB200 (SM 10.0).
6	Use vLLM 0.19.1rc1.dev110 or later with Blackwell-compiled FP4 kernels	Stock vLLM wheels don't compile FP4 kernels for SM 10/12. Use the pre-built container, or build from source with `TORCH_CUDA_ARCH_LIST="10.0;12.0;12.1"` + transformers 5.5+.

✅ Verified-working config (use this verbatim)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest   # NOT the non-awq variant
    environment:
      - VLLM_TEST_FORCE_FP8_MARLIN=1     # required for MoE
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
    volumes:
      # model + 2 patches ONLY — do not mount modelopt_patched.py (it's baked in)
      - ./model:/models/supergemma4
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
    command: >
      vllm serve /models/supergemma4
      --quantization modelopt
      --kv-cache-dtype fp8_e4m3
      --tensor-parallel-size 1
      --max-model-len 65536
      --max-num-seqs 4
      --gpu-memory-utilization 0.85
      --trust-remote-code
      --host 0.0.0.0 --port 8000
      --enable-chunked-prefill
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser gemma4
      --reasoning-parser gemma4

❌ Known FAILURE modes (things that DON'T work)

What you might try	What happens	Fix
Use `vllm-spark-gemma4-nvfp4:latest` (non-awq image)	Model loads but produces `"-Hello-Hello-"` or empty content	Switch to `-awq` variant
Drop `VLLM_TEST_FORCE_FP8_MARLIN=1` — let native FP4 MoE auto-select	`KeyError` during weight load OR corrupted output	Re-add the env var
Mount your own `modelopt_patched.py` on top	Scale values get double-inverted → garbage output	Remove the mount; use image's baked version
Use `--quantization compressed-tensors`	`KeyError: 'weight_packed'` during load	Use `modelopt`
Use `--kv-cache-dtype fp8` instead of `fp8_e4m3`	Works, but slight accuracy drift on long context	Use `fp8_e4m3` as specified
Mount `eagle_patched.py` for spec decode	`AttributeError: image_token_index`	Gemma4 EAGLE3 not yet supported upstream; omit spec decode for now

🐛 If you see gibberish output after following all of the above

Verify the image: docker inspect <container> | grep Image should show vllm-spark-gemma4-nvfp4-awq
Verify mounts: docker inspect <container> --format '{{json .Mounts}}' should show exactly 3 mounts (model + 2 patches)
Verify backend selection in logs:
- Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM ✅
- Using 'MARLIN' NvFp4 MoE backend ✅
Test with raw chat: {"messages": [{"role": "user", "content": "Capital of France? One sentence."}]} — should return "The capital of France is Paris.". If not, check the container logs for crashes or UNEXPECTED tensor warnings at load time.

Performance Benchmarks

NVIDIA DGX Spark (GB10, SM 12.0, 128 GB unified memory) — vLLM 0.19.1rc1.dev110+gb55d830ec, FP8 E4M3 KV cache, native FlashInfer CUTLASS linear + Marlin MoE backend, --gpu-memory-utilization 0.85.

1. Single-Stream Performance (README-spec config)

--max-num-seqs 4, --max-model-len 65536, --gpu-memory-utilization 0.85. Best for interactive chat, agentic UX, single-user serving. All measurements greedy sampling (temp=0) unless noted.

Decode rate (10 trials, 200 tokens output)

Statistic	tok/s
Median	51.1
P95	51.5
Min	50.9
Max	51.5

Extremely stable — ±0.5 tok/s variance across 10 trials.

TTFT by prompt length

Time from request to first token, across 5 trials each:

Prompt class	Prompt tokens	TTFT median	TTFT p95	TTFT min	Effective prefill
Tiny	14	56 ms	58 ms	55 ms	250 tok/s
Short	19	49 ms	61 ms	48 ms	386 tok/s
Medium	49	45 ms	46 ms	44 ms	1,093 tok/s
Long	465	47 ms	47 ms	45 ms	9,996 tok/s

Even 465-token prompts give sub-50ms TTFT — fixed kernel-launch overhead dominates over prefill for anything < ~500 tokens.

Decode rate by output length

Longer outputs are slightly slower due to growing KV cache:

Max tokens	Actual tokens	TTFT	Decode rate	Total latency
50	50	49 ms	51.9 tok/s	1.01 s
200	200	50 ms	50.9 tok/s	3.98 s
500	500	49 ms	50.7 tok/s	9.90 s
1000	558*	61 ms	50.6 tok/s	11.10 s

*Short because model hit EOS naturally before 1000 tokens.

Sampling: Greedy vs Stochastic

Temperature has negligible performance impact:

Mode	Decode median	Decode p95	TTFT median
Greedy (temp=0)	51.9 tok/s	52.1 tok/s	48 ms
Stochastic (temp=0.7)	51.0 tok/s	51.2 tok/s	51 ms

Long-prompt prefill (RAG / document workloads)

Prefill throughput scales impressively with length — MoE's sparse compute is the perfect shape for prefill:

Target prompt tokens	Actual	TTFT	Prefill rate	Decode rate (after prefill)
1K	809	0.06 s	14,450 tok/s	52.1 tok/s
4K	3,172	0.05 s	66,851 tok/s	51.9 tok/s
16K	12,625	0.09 s	139,648 tok/s	50.6 tok/s
32K	25,228	0.13 s	194,082 tok/s	48.3 tok/s

Decode only drops 7% at 32K context — excellent KV-cache bandwidth behavior. Prefill peaks around 194K tok/s at 32K prompt length.

Summary

Metric	Value
Single-stream decode (200-tok output)	51.1 tok/s median
Short-prompt TTFT	44-56 ms
16K-prompt TTFT	90 ms
32K-prompt TTFT	130 ms
Peak prefill throughput	194K tok/s @ 32K prompt
Decode rate with 32K context	48.3 tok/s (7% drop vs short context)

This matches and exceeds the original v6 validation (52.6 tok/s / 54 ms TTFT).

2. Concurrent-Session Performance (max-throughput config)

--max-num-seqs 256, --max-model-len 2048, --max-num-batched-tokens 16384, --gpu-memory-utilization 0.85. Best for agent fleets, multi-user serving, batch inference. 3 trials per level with median reported. Mixed prompts (code, math, QA, creative), 200 token output, temp=0.7, SSE streaming.

Throughput scaling (N concurrent clients, 200-tok output)

Concurrent	Agg tok/s (median of 3)	Per-Req decode p50	Per-Req decode min	TTFT p50	TTFT p95	TTFT max
1	35.9	50.8	50.8	64ms	64ms	64ms
2	51.8	48.3	45.9	59ms	59ms	59ms
4	88.6	39.0	36.4	70ms	71ms	71ms
8	149.1	31.7	28.8	135ms	135ms	135ms
16	269.6	24.9	23.1	149ms	150ms	150ms
32	422.3	20.0	18.5	194ms	195ms	195ms
64	711.5	16.6	15.6	284ms	285ms	286ms
128	1,154.0	13.8	13.2	449ms	545ms	548ms
256	1,775.9	10.7	6.5	851ms	863ms	864ms

Zero errors across 1,200+ requests in the full test. Aggregate throughput scales nearly linearly up to 128 concurrent, with diminishing returns at 256 as scheduling and KV-cache contention dominate.

Note: single-stream here is 35.9 tok/s (vs 51.1 in README config) because max-num-seqs=256 forces allocation of 50+ CUDA graph sizes and different scheduling heuristics that optimize for batched throughput over single-stream latency. Use README config for chat; use this config for fleets.

TTFT-only scaling (prefill + first token, 1-token output)

Measures how much queue contention affects time-to-first-token — critical for agent UX:

Concurrent	TTFT p50	TTFT p95	TTFT max	TTFT min
1	47ms	48ms	48ms	46ms
4	48ms	88ms	88ms	46ms
16	100ms	102ms	102ms	55ms
64	171ms	174ms	175ms	91ms
256	522ms	527ms	530ms	167ms

TTFT stays sub-200ms up through 64 concurrent — smooth UX for small agent fleets. Above 128 concurrent TTFT doubles per level as requests queue for scheduler capacity.

Concurrent with 1K-token prompts (RAG-style workload)

50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses:

Concurrent	Agg tok/s	TTFT p50	TTFT p95	Decode p50
1	42.8	55ms	55ms	49.3
4	109.2	82ms	103ms	38.7
16	272.2	147ms	147ms	27.0
64	711.9	261ms	293ms	16.9

Long-prompt concurrent workloads scale as well as short-prompt ones (prefill is very fast on MoE with 194K tok/s peak throughput).

Summary

Metric	Value
Peak aggregate throughput	1,776 tok/s @ 256 concurrent (median of 3 trials)
Scaling from 1 → 256	49.5× throughput (ideal would be 256×)
Per-request decode @ 256	10.7 tok/s median, 6.5 min
Peak server-reported generation	2,022 tok/s (vLLM engine stats)
Peak combined (prompt + gen)	2,627 tok/s
TTFT @ 64 concurrent	284 ms median (usable)
TTFT @ 256 concurrent	851 ms median (acceptable for batch)
Error rate across full test	0.0% (1,200+ requests)
Best concurrency for chat UX	4-8 (per-request 30-40 tok/s, TTFT <150ms)
Best concurrency for throughput	128-256 (maxes aggregate, TTFT trade-off)

Key Performance Metrics

Metric	Value
Single-stream decode (README config)	52.2 tok/s
Short-prompt TTFT (README config)	44 ms
Peak aggregate throughput (bench config)	1,890 tok/s @ 256 concurrent
Peak server-reported generation	2,022 tok/s (vLLM engine stats)
Peak combined (prompt + gen)	2,627 tok/s
Model load time	~4-5 min (weight load + torch.compile + CUDA graphs + FP4 autotune)
Model memory footprint	16.4 GB
KV cache capacity	~700K tokens @ fp8_e4m3
GEMM backend (linear)	FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores)
MoE backend	MARLIN (required — FlashInfer MoE variants reject 704-per-expert intermediate)
Attention backend	TRITON_ATTN (heterogeneous head dims require Triton)
Prefix cache hit rate	~70-80% (sustained, mixed workload)

Scaling Efficiency

Concurrency	Throughput Gain vs 1-req
1	1.0x
4	2.7x
16	7.7x
64	19.8x
128	32.8x
256	50.0x

Aggregate throughput scales 50x from 1 to 256 concurrent requests — excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 37.8 tok/s (1-req) to 9.3 tok/s (256-req), still usable for agent workloads with many short-lived subagents.

Why MoE is Fast on DGX Spark

GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces per-token bandwidth demand:

Model	Params Read/Token	BW Required @ 50 tok/s	Fits GB10?
Dense 27B (BF16)	~54 GB	2,700 GB/s	No
Dense 27B (NVFP4)	~13.5 GB	675 GB/s	No
MoE 26B top-8/128 (NVFP4)	~2.8 GB	140 GB/s	Yes (51% BW)

Key Specs

	Original (BF16)	NVFP4 (this model)
Size on disk	~49 GB	~16.4 GB
Total parameters	25.2B	25.2B
Active parameters	3.8B / token	3.8B / token
Architecture	MoE: 128 experts, 8 active / token	same
Context window	262K tokens	262K tokens
Modalities	Text, Image, Video	Text, Image, Video
Quantization	—	NVFP4 (W4A4, block size 16)
Vision encoder	BF16	BF16 (preserved, not quantized)

Model Details

Property	Value
Architecture	Gemma 4 MoE (26B total, 3.8B active / token)
Layers	30 (25 sliding-window + 5 full-attention)
Experts	128 total, top-8 active per token
Sliding Window	1024 tokens
Max Context	262,144 tokens
Hidden Size	2816
MoE Intermediate	704 per expert
Attention Heads	16 (8 KV heads), head_dim=256, global_head_dim=512
Vision Encoder	27-layer ViT (1152 hidden, 16 heads, patch_size=16)
Vocabulary	262,144 tokens
Quantization	NVFP4 (ModelOpt 0.43 main + 2 pending PRs)

Pre-Built Container Image

A pre-built vLLM container compiled for NVIDIA DGX Spark (GB10, SM 12.1) is available with all required patches pre-applied:

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

Image contents:

vLLM 0.19.1rc1 compiled for SM 12.1 (Blackwell GB10)
PyTorch 2.12.0 + CUDA 13.0
transformers 5.5.0 + FlashInfer 0.6.7
Patched gemma4.py — extends expert_params_mapping to the modelopt suffix set (weight, weight_scale, weight_scale_2, input_scale)
Patched serving.py — fixes non-streaming reasoning parser for Gemma 4
Patched modelopt.py — handles the per-expert-decomposed NVFP4 scale format
Built from eugr/spark-vllm-docker with --tf5 flag

Critical: Use the -awq variant of the image. The non--awq image does not include the baked-in modelopt scale-handling patches required for this model's per-expert NVFP4 format.

Container on GHCR

Quick Start

1. Pull the container

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

2. Download the model

pip install -U huggingface-hub hf_transfer

HF_HUB_ENABLE_HF_TRANSFER=1 \
  hf download AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
  --local-dir ~/models/supergemma4-26b

3. Get the patches

Only two patch files need to be mounted — modelopt.py is baked into the -awq image:

for f in gemma4_patched.py serving_chat_patched.py; do
  curl -LO https://raw.githubusercontent.com/AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4/main/$f
done

4. Launch with Docker Compose

Save as docker-compose.yml:

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-supergemma4-26b
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/supergemma4-26b:/models/supergemma4
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
    environment:
      # Force Marlin MoE path — native FlashInfer MoE variants reject 704-per-expert intermediate
      - VLLM_TEST_FORCE_FP8_MARLIN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/supergemma4 \
          --served-model-name supergemma4-26b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype fp8_e4m3 \
          --tensor-parallel-size 1 \
          --max-model-len 65536 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --host 0.0.0.0 --port 8000 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Then:

docker compose up -d

Startup takes ~4-5 minutes (weight load + torch.compile + CUDA graph capture + FP4 GEMM autotuning).

Workload-tuned configs

Workload	max-model-len	max-num-seqs	Best for
Long-context (RAG, docs)	65536	4	Few long conversations
Balanced	8192	32	Mixed chat + agents
Max throughput	2048	256	Many short agents (1,890 tok/s)
Max context	262144	1	Single-stream, max window

5. Test

# Text
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "supergemma4-26b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 300
  }'

# Vision
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "supergemma4-26b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},
        {"type": "text", "text": "Describe what you see."}
      ]
    }],
    "max_tokens": 300
  }'

The API is fully OpenAI-compatible — use with any OpenAI SDK, LangChain, LiteLLM, Open WebUI at http://<your-ip>:8000/v1.

Key Deployment Flags

Flag	Purpose
`VLLM_TEST_FORCE_FP8_MARLIN=1`	Required — forces Marlin MoE path (FlashInfer NVFP4 MoE backends reject 704-intermediate)
`--quantization modelopt`	Required — tells vLLM to use NVIDIA ModelOpt NVFP4 format
`--kv-cache-dtype fp8_e4m3`	FP8 KV cache — doubles token budget vs BF16
`--max-model-len 65536`	64K context. Model supports 262K; trade for concurrency
`--max-num-seqs 4`	README default. Bump to 256 for max throughput workloads
`--gpu-memory-utilization 0.85`	MoE leaves 15% headroom; tune for your hardware
`--reasoning-parser gemma4`	Extracts `<think>` blocks into `reasoning_content` in API response
`--tool-call-parser gemma4`	Native Gemma 4 function/tool calling
`--enable-chunked-prefill`	Processes long prompts in chunks
`--enable-prefix-caching`	Caches common system prompt prefixes

Quantization Details

Parameter	Value
Tool	NVIDIA ModelOpt 0.43.0rc2.dev (from upstream main)
Config	`NVFP4_DEFAULT_CFG` (plain NVFP4, no AWQ)
Weight dtype	NVFP4 (FP4 E2M1, block size 16)
Calibration samples	512 (CNN/DailyMail train split)
Calibration seq_len	4096 tokens
Batch size	3 (VRAM-probed)
Calibration hardware	NVIDIA RTX PRO 6000 Blackwell (97 GB VRAM)
Calibration wall-clock	12.75 min (via `modelopt-fast-moe` adaptive batching)
Excluded from quantization	`vision_tower`, `embed_vision`, `multi_modal_projector`, routers (BF16)
Exported size	16.42 GB

Why plain NVFP4 instead of NVFP4_AWQ?

Earlier experiments used NVFP4_AWQ_FULL_CFG (AWQ with exhaustive alpha grid search) but ran into a deployment-stack limitation: vLLM's ModelOptNvFp4FusedMoE does not support per-expert pre_quant_scale. On MoE models, AWQ calibration computes a per-expert scaling factor that can't be consumed by the MoE kernel path — any AWQ work on experts is wasted at serve time.

Switching to plain NVFP4 (algorithm=max):

Cuts calibration time from ~2.5h to ~12 min (no alpha search phase)
Produces a checkpoint vLLM's FusedMoE loads natively without tensor surgery
Quality hit is negligible since the AWQ benefit on MoE experts was already unavailable at inference time

Attention and dense shared MLP layers still benefit from NVFP4's per-block scaling. Router weights stay in BF16 (routing quality is critical for MoE accuracy and experts are cheap to leave un-quantized there).

Applied modelopt patches

Two upstream PR fixes applied locally (pending review as of this writing):

PR #1264 — preprocess_linear_fusion non-scalar amax fix
PR #1265 — get_activation_scaling_factor zero-amax handling

Both are blockers for anyone quantizing per-expert-decomposed MoEs in NVFP4 with modelopt 0.42 or 0.43. The -awq container image includes these patches in its baked modelopt.py — do not override with a stock version.

fast-moe adaptive batched calibration

Calibration uses modelopt-fast-moe — adaptive VRAM-probed batching that fixes the Python-dispatch bottleneck when quantizing MoE models (the naive for ids in calib_data: model(ids) loop leaves GPUs at 25-30% utilization).

End-to-end calibration wall-clock:

Configuration	Wall-clock
Naive bs=1 loop (modelopt default)	~50h projected (killed at 18h)
fast-moe + NVFP4_AWQ_FULL (earlier v3 attempt)	2h 24min
fast-moe + NVFP4_DEFAULT (this v6)	12 min

NVFP4 Weight Format

Each quantized layer stores:

weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
weight_scale_2 (float32) — per-tensor global scale (stored as modelopt reciprocal convention)
input_scale (float32) — static activation scale from calibration

Quality Validation

Greedy-sampled responses (temperature=0.0):

Prompt	Response
"What is the capital of France?"	"The capital of France is Paris."
"What is 17 * 23?"	"391" ✓
"Write a haiku about the ocean."	"Blue waves kiss the shore, / Endless water, salt and spray, / Deep blue mysteries."
"Name three cities in Japan."	"1. Tokyo 2. Osaka 3. Kyoto"
"One-line Python prime function"	Valid implementation with correct base cases
"Explain photosynthesis in one sentence."	Correct, coherent summary

Speculative Decoding (DFlash — Coming Soon)

A DFlash block-diffusion drafter paired with this model is in training. DFlash can provide 2-3× additional throughput over the numbers above by predicting multi-token blocks in a single draft forward pass. Will be published as a separate drafter repo once training completes.

Dense (31B) vs MoE (26B) Comparison

Metric	31B DECKARD Dense	This Model (26B MoE)
Active params / token	31.3B	~3.8B
NVFP4 model size	20.5 GB	16.4 GB
Single-stream tok/s (Spark)	~11-14	37.8
Peak aggregate (Spark)	—	1,890 tok/s @ 256
Context window	262K	262K
Vision	Yes	Yes
Best for	Quality-critical tasks	Speed, concurrency, efficiency

Hardware Requirements

Tier	GPU	Notes
Target	NVIDIA DGX Spark (128 GB unified)	Full 262K context, up to 6 concurrent seqs
Compatible	RTX 5090 (32 GB)	Reduced context, 1-2 seqs
Compatible	B200 / GB200	Full context, high concurrency
Compatible	RTX PRO 6000 Blackwell (97 GB)	Calibration + serving
Minimum	Any Blackwell GPU (SM 10.0+)	Required for native FP4

Native FP4 hardware (Blackwell architecture) is required — will not run on Ampere or Ada GPUs.

Related Projects

Models

Model	Type	Size	Link
SuperGemma4 26B NVFP4 (this)	MoE NVFP4	16.4 GB	GitHub
Gemma-4-26B-A4B-it-Uncensored NVFP4	MoE NVFP4 (compressed-tensors)	15.3 GB	HuggingFace
Gemma 4 31B DECKARD	Dense NVFP4 AWQ	20.5 GB	HuggingFace
gemma-4-31B-it-speculator.eagle3 NVFP4	EAGLE3 drafter NVFP4	3.5 GB	HuggingFace

Infrastructure

Resource	Description	Link
vLLM AWQ Container	Pre-built for DGX Spark (SM 12.1) with all patches	GHCR
Build System	`spark-vllm-docker`	GitHub
`modelopt-fast-moe`	Adaptive batched calibration	GitHub
Base Model	SuperGemma4 26B Abliterated Multimodal (BF16)	HuggingFace

Disclaimer

THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, you expressly acknowledge that you assume full and sole responsibility for all outputs generated, all actions taken based on outputs, and compliance with applicable laws. The authors are not responsible for any harmful, illegal, or objectionable content produced by the model. These tools serve legitimate purposes including security research, red-teaming, content analysis, and creative work. Implement safeguards appropriate to your use case and jurisdiction.

License

This model inherits the Gemma license from Google.

Credits

Quantized by AEON-7 on NVIDIA Blackwell hardware. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build, NVIDIA for TensorRT-Model-Optimizer, and the z-lab / ModelOpt teams for DFlash.

Downloads last month: 1,320

Safetensors

Model size

25B params

Tensor type

BF16

F8_E4M3

Model tree for AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4

Base model

Jiunsong/supergemma4-26b-abliterated-multimodal

Quantized

(10)

this model

Finetunes

1 model