SuperGemma4 26B Abliterated Multimodal — NVFP4

NVFP4-quantized version of Jiunsong/supergemma4-26b-abliterated-multimodal — an abliterated (uncensored) Gemma 4 26B Mixture-of-Experts multimodal model with thinking/reasoning capabilities.

Quantized using NVIDIA ModelOpt 0.43 (main) with NVFP4_DEFAULT_CFG on a native Blackwell GPU. Vision encoder preserved in full BF16. Peak aggregate throughput: 1,890 tok/s @ 256 concurrent on DGX Spark (GB10).

Verified end-to-end: calibrated → exported → served on Spark → benchmarked 1-256 concurrency.


⚠️ IMPORTANT REQUIREMENTS — READ THIS FIRST

This model has non-obvious serving requirements because its per-expert-decomposed NVFP4 scale format needs specific plugin handling. Deviating from these will produce garbage output or crashes. Details below — each requirement is backed by hours of debugging.

🔴 MUST-DO requirements

# Requirement Why
1 Use the -awq container image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest Has baked-in modelopt.py scale handling (PRs #1264, #1265) + patched expert_params_mapping. The non--awq variant (or any stock vLLM) will crash or produce corrupted output.
2 Force Marlin MoE: set env VLLM_TEST_FORCE_FP8_MARLIN=1 and VLLM_MARLIN_USE_ATOMIC_ADD=1 FlashInfer NVFP4 MoE backends reject the 704-per-expert intermediate dim. Do NOT set VLLM_NVFP4_GEMM_BACKEND=marlin — that would also force Marlin on the LINEAR path where native FLASHINFER_CUTLASS is faster. Let linear auto-select; only MoE needs Marlin.
3 Mount both patches: gemma4_patched.py and serving_chat_patched.py (2 files, NOT 3) modelopt_patched.py is baked into the image — mounting a stock version on top will corrupt the scale convention. Only mount the two listed here.
4 Use --quantization modelopt (not compressed-tensors) This checkpoint uses the modelopt NVFP4 format. compressed-tensors looks for different key names and will fail to load.
5 Native Blackwell GPU required (SM 10.0+) Ampere / Ada GPUs have no native FP4 compute path. Verified on: GB10 (SM 12.0, DGX Spark), RTX PRO 6000 Blackwell (SM 12.1), should work on B200/GB200 (SM 10.0).
6 Use vLLM 0.19.1rc1.dev110 or later with Blackwell-compiled FP4 kernels Stock vLLM wheels don't compile FP4 kernels for SM 10/12. Use the pre-built container, or build from source with TORCH_CUDA_ARCH_LIST="10.0;12.0;12.1" + transformers 5.5+.

✅ Verified-working config (use this verbatim)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest   # NOT the non-awq variant
    environment:
      - VLLM_TEST_FORCE_FP8_MARLIN=1     # required for MoE
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
    volumes:
      # model + 2 patches ONLY — do not mount modelopt_patched.py (it's baked in)
      - ./model:/models/supergemma4
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
    command: >
      vllm serve /models/supergemma4
      --quantization modelopt
      --kv-cache-dtype fp8_e4m3
      --tensor-parallel-size 1
      --max-model-len 65536
      --max-num-seqs 4
      --gpu-memory-utilization 0.85
      --trust-remote-code
      --host 0.0.0.0 --port 8000
      --enable-chunked-prefill
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser gemma4
      --reasoning-parser gemma4

❌ Known FAILURE modes (things that DON'T work)

What you might try What happens Fix
Use vllm-spark-gemma4-nvfp4:latest (non-awq image) Model loads but produces "-Hello-Hello-" or empty content Switch to -awq variant
Drop VLLM_TEST_FORCE_FP8_MARLIN=1 — let native FP4 MoE auto-select KeyError during weight load OR corrupted output Re-add the env var
Mount your own modelopt_patched.py on top Scale values get double-inverted → garbage output Remove the mount; use image's baked version
Use --quantization compressed-tensors KeyError: 'weight_packed' during load Use modelopt
Use --kv-cache-dtype fp8 instead of fp8_e4m3 Works, but slight accuracy drift on long context Use fp8_e4m3 as specified
Mount eagle_patched.py for spec decode AttributeError: image_token_index Gemma4 EAGLE3 not yet supported upstream; omit spec decode for now

🐛 If you see gibberish output after following all of the above

  1. Verify the image: docker inspect <container> | grep Image should show vllm-spark-gemma4-nvfp4-awq
  2. Verify mounts: docker inspect <container> --format '{{json .Mounts}}' should show exactly 3 mounts (model + 2 patches)
  3. Verify backend selection in logs:
    • Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
    • Using 'MARLIN' NvFp4 MoE backend
  4. Test with raw chat: {"messages": [{"role": "user", "content": "Capital of France? One sentence."}]} — should return "The capital of France is Paris.". If not, check the container logs for crashes or UNEXPECTED tensor warnings at load time.

Performance Benchmarks

NVIDIA DGX Spark (GB10, SM 12.0, 128 GB unified memory) — vLLM 0.19.1rc1.dev110+gb55d830ec, FP8 E4M3 KV cache, native FlashInfer CUTLASS linear + Marlin MoE backend, --gpu-memory-utilization 0.85.

1. Single-Stream Performance (README-spec config)

--max-num-seqs 4, --max-model-len 65536, --gpu-memory-utilization 0.85. Best for interactive chat, agentic UX, single-user serving. All measurements greedy sampling (temp=0) unless noted.

Decode rate (10 trials, 200 tokens output)

Statistic tok/s
Median 51.1
P95 51.5
Min 50.9
Max 51.5

Extremely stable — ±0.5 tok/s variance across 10 trials.

TTFT by prompt length

Time from request to first token, across 5 trials each:

Prompt class Prompt tokens TTFT median TTFT p95 TTFT min Effective prefill
Tiny 14 56 ms 58 ms 55 ms 250 tok/s
Short 19 49 ms 61 ms 48 ms 386 tok/s
Medium 49 45 ms 46 ms 44 ms 1,093 tok/s
Long 465 47 ms 47 ms 45 ms 9,996 tok/s

Even 465-token prompts give sub-50ms TTFT — fixed kernel-launch overhead dominates over prefill for anything < ~500 tokens.

Decode rate by output length

Longer outputs are slightly slower due to growing KV cache:

Max tokens Actual tokens TTFT Decode rate Total latency
50 50 49 ms 51.9 tok/s 1.01 s
200 200 50 ms 50.9 tok/s 3.98 s
500 500 49 ms 50.7 tok/s 9.90 s
1000 558* 61 ms 50.6 tok/s 11.10 s

*Short because model hit EOS naturally before 1000 tokens.

Sampling: Greedy vs Stochastic

Temperature has negligible performance impact:

Mode Decode median Decode p95 TTFT median
Greedy (temp=0) 51.9 tok/s 52.1 tok/s 48 ms
Stochastic (temp=0.7) 51.0 tok/s 51.2 tok/s 51 ms

Long-prompt prefill (RAG / document workloads)

Prefill throughput scales impressively with length — MoE's sparse compute is the perfect shape for prefill:

Target prompt tokens Actual TTFT Prefill rate Decode rate (after prefill)
1K 809 0.06 s 14,450 tok/s 52.1 tok/s
4K 3,172 0.05 s 66,851 tok/s 51.9 tok/s
16K 12,625 0.09 s 139,648 tok/s 50.6 tok/s
32K 25,228 0.13 s 194,082 tok/s 48.3 tok/s

Decode only drops 7% at 32K context — excellent KV-cache bandwidth behavior. Prefill peaks around 194K tok/s at 32K prompt length.

Summary

Metric Value
Single-stream decode (200-tok output) 51.1 tok/s median
Short-prompt TTFT 44-56 ms
16K-prompt TTFT 90 ms
32K-prompt TTFT 130 ms
Peak prefill throughput 194K tok/s @ 32K prompt
Decode rate with 32K context 48.3 tok/s (7% drop vs short context)

This matches and exceeds the original v6 validation (52.6 tok/s / 54 ms TTFT).

2. Concurrent-Session Performance (max-throughput config)

--max-num-seqs 256, --max-model-len 2048, --max-num-batched-tokens 16384, --gpu-memory-utilization 0.85. Best for agent fleets, multi-user serving, batch inference. 3 trials per level with median reported. Mixed prompts (code, math, QA, creative), 200 token output, temp=0.7, SSE streaming.

Throughput scaling (N concurrent clients, 200-tok output)

Concurrent Err Agg tok/s (median of 3) Per-Req decode p50 Per-Req decode min TTFT p50 TTFT p95 TTFT max
1 0 35.9 50.8 50.8 64ms 64ms 64ms
2 0 51.8 48.3 45.9 59ms 59ms 59ms
4 0 88.6 39.0 36.4 70ms 71ms 71ms
8 0 149.1 31.7 28.8 135ms 135ms 135ms
16 0 269.6 24.9 23.1 149ms 150ms 150ms
32 0 422.3 20.0 18.5 194ms 195ms 195ms
64 0 711.5 16.6 15.6 284ms 285ms 286ms
128 0 1,154.0 13.8 13.2 449ms 545ms 548ms
256 0 1,775.9 10.7 6.5 851ms 863ms 864ms

Zero errors across 1,200+ requests in the full test. Aggregate throughput scales nearly linearly up to 128 concurrent, with diminishing returns at 256 as scheduling and KV-cache contention dominate.

Note: single-stream here is 35.9 tok/s (vs 51.1 in README config) because max-num-seqs=256 forces allocation of 50+ CUDA graph sizes and different scheduling heuristics that optimize for batched throughput over single-stream latency. Use README config for chat; use this config for fleets.

TTFT-only scaling (prefill + first token, 1-token output)

Measures how much queue contention affects time-to-first-token — critical for agent UX:

Concurrent TTFT p50 TTFT p95 TTFT max TTFT min
1 47ms 48ms 48ms 46ms
4 48ms 88ms 88ms 46ms
16 100ms 102ms 102ms 55ms
64 171ms 174ms 175ms 91ms
256 522ms 527ms 530ms 167ms

TTFT stays sub-200ms up through 64 concurrent — smooth UX for small agent fleets. Above 128 concurrent TTFT doubles per level as requests queue for scheduler capacity.

Concurrent with 1K-token prompts (RAG-style workload)

50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses:

Concurrent Err Agg tok/s TTFT p50 TTFT p95 Decode p50
1 0 42.8 55ms 55ms 49.3
4 0 109.2 82ms 103ms 38.7
16 0 272.2 147ms 147ms 27.0
64 0 711.9 261ms 293ms 16.9

Long-prompt concurrent workloads scale as well as short-prompt ones (prefill is very fast on MoE with 194K tok/s peak throughput).

Summary

Metric Value
Peak aggregate throughput 1,776 tok/s @ 256 concurrent (median of 3 trials)
Scaling from 1 → 256 49.5× throughput (ideal would be 256×)
Per-request decode @ 256 10.7 tok/s median, 6.5 min
Peak server-reported generation 2,022 tok/s (vLLM engine stats)
Peak combined (prompt + gen) 2,627 tok/s
TTFT @ 64 concurrent 284 ms median (usable)
TTFT @ 256 concurrent 851 ms median (acceptable for batch)
Error rate across full test 0.0% (1,200+ requests)
Best concurrency for chat UX 4-8 (per-request 30-40 tok/s, TTFT <150ms)
Best concurrency for throughput 128-256 (maxes aggregate, TTFT trade-off)

Key Performance Metrics

Metric Value
Single-stream decode (README config) 52.2 tok/s
Short-prompt TTFT (README config) 44 ms
Peak aggregate throughput (bench config) 1,890 tok/s @ 256 concurrent
Peak server-reported generation 2,022 tok/s (vLLM engine stats)
Peak combined (prompt + gen) 2,627 tok/s
Model load time ~4-5 min (weight load + torch.compile + CUDA graphs + FP4 autotune)
Model memory footprint 16.4 GB
KV cache capacity ~700K tokens @ fp8_e4m3
GEMM backend (linear) FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores)
MoE backend MARLIN (required — FlashInfer MoE variants reject 704-per-expert intermediate)
Attention backend TRITON_ATTN (heterogeneous head dims require Triton)
Prefix cache hit rate ~70-80% (sustained, mixed workload)

Scaling Efficiency

Concurrency Throughput Gain vs 1-req
1 1.0x
4 2.7x
16 7.7x
64 19.8x
128 32.8x
256 50.0x

Aggregate throughput scales 50x from 1 to 256 concurrent requests — excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 37.8 tok/s (1-req) to 9.3 tok/s (256-req), still usable for agent workloads with many short-lived subagents.

Why MoE is Fast on DGX Spark

GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces per-token bandwidth demand:

Model Params Read/Token BW Required @ 50 tok/s Fits GB10?
Dense 27B (BF16) ~54 GB 2,700 GB/s No
Dense 27B (NVFP4) ~13.5 GB 675 GB/s No
MoE 26B top-8/128 (NVFP4) ~2.8 GB 140 GB/s Yes (51% BW)

Key Specs

Original (BF16) NVFP4 (this model)
Size on disk ~49 GB ~16.4 GB
Total parameters 25.2B 25.2B
Active parameters 3.8B / token 3.8B / token
Architecture MoE: 128 experts, 8 active / token same
Context window 262K tokens 262K tokens
Modalities Text, Image, Video Text, Image, Video
Quantization NVFP4 (W4A4, block size 16)
Vision encoder BF16 BF16 (preserved, not quantized)

Model Details

Property Value
Architecture Gemma 4 MoE (26B total, 3.8B active / token)
Layers 30 (25 sliding-window + 5 full-attention)
Experts 128 total, top-8 active per token
Sliding Window 1024 tokens
Max Context 262,144 tokens
Hidden Size 2816
MoE Intermediate 704 per expert
Attention Heads 16 (8 KV heads), head_dim=256, global_head_dim=512
Vision Encoder 27-layer ViT (1152 hidden, 16 heads, patch_size=16)
Vocabulary 262,144 tokens
Quantization NVFP4 (ModelOpt 0.43 main + 2 pending PRs)

Pre-Built Container Image

A pre-built vLLM container compiled for NVIDIA DGX Spark (GB10, SM 12.1) is available with all required patches pre-applied:

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

Image contents:

  • vLLM 0.19.1rc1 compiled for SM 12.1 (Blackwell GB10)
  • PyTorch 2.12.0 + CUDA 13.0
  • transformers 5.5.0 + FlashInfer 0.6.7
  • Patched gemma4.py — extends expert_params_mapping to the modelopt suffix set (weight, weight_scale, weight_scale_2, input_scale)
  • Patched serving.py — fixes non-streaming reasoning parser for Gemma 4
  • Patched modelopt.py — handles the per-expert-decomposed NVFP4 scale format
  • Built from eugr/spark-vllm-docker with --tf5 flag

Critical: Use the -awq variant of the image. The non--awq image does not include the baked-in modelopt scale-handling patches required for this model's per-expert NVFP4 format.

Container on GHCR


Quick Start

1. Pull the container

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

2. Download the model

pip install -U huggingface-hub hf_transfer

HF_HUB_ENABLE_HF_TRANSFER=1 \
  hf download AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
  --local-dir ~/models/supergemma4-26b

3. Get the patches

Only two patch files need to be mounted — modelopt.py is baked into the -awq image:

for f in gemma4_patched.py serving_chat_patched.py; do
  curl -LO https://raw.githubusercontent.com/AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4/main/$f
done

4. Launch with Docker Compose

Save as docker-compose.yml:

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-supergemma4-26b
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/supergemma4-26b:/models/supergemma4
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
    environment:
      # Force Marlin MoE path — native FlashInfer MoE variants reject 704-per-expert intermediate
      - VLLM_TEST_FORCE_FP8_MARLIN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/supergemma4 \
          --served-model-name supergemma4-26b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype fp8_e4m3 \
          --tensor-parallel-size 1 \
          --max-model-len 65536 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --host 0.0.0.0 --port 8000 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Then:

docker compose up -d

Startup takes ~4-5 minutes (weight load + torch.compile + CUDA graph capture + FP4 GEMM autotuning).

Workload-tuned configs

Workload max-model-len max-num-seqs Best for
Long-context (RAG, docs) 65536 4 Few long conversations
Balanced 8192 32 Mixed chat + agents
Max throughput 2048 256 Many short agents (1,890 tok/s)
Max context 262144 1 Single-stream, max window

5. Test

# Text
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "supergemma4-26b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 300
  }'

# Vision
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "supergemma4-26b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},
        {"type": "text", "text": "Describe what you see."}
      ]
    }],
    "max_tokens": 300
  }'

The API is fully OpenAI-compatible — use with any OpenAI SDK, LangChain, LiteLLM, Open WebUI at http://<your-ip>:8000/v1.


Key Deployment Flags

Flag Purpose
VLLM_TEST_FORCE_FP8_MARLIN=1 Required — forces Marlin MoE path (FlashInfer NVFP4 MoE backends reject 704-intermediate)
--quantization modelopt Required — tells vLLM to use NVIDIA ModelOpt NVFP4 format
--kv-cache-dtype fp8_e4m3 FP8 KV cache — doubles token budget vs BF16
--max-model-len 65536 64K context. Model supports 262K; trade for concurrency
--max-num-seqs 4 README default. Bump to 256 for max throughput workloads
--gpu-memory-utilization 0.85 MoE leaves 15% headroom; tune for your hardware
--reasoning-parser gemma4 Extracts <think> blocks into reasoning_content in API response
--tool-call-parser gemma4 Native Gemma 4 function/tool calling
--enable-chunked-prefill Processes long prompts in chunks
--enable-prefix-caching Caches common system prompt prefixes

Quantization Details

Parameter Value
Tool NVIDIA ModelOpt 0.43.0rc2.dev (from upstream main)
Config NVFP4_DEFAULT_CFG (plain NVFP4, no AWQ)
Weight dtype NVFP4 (FP4 E2M1, block size 16)
Calibration samples 512 (CNN/DailyMail train split)
Calibration seq_len 4096 tokens
Batch size 3 (VRAM-probed)
Calibration hardware NVIDIA RTX PRO 6000 Blackwell (97 GB VRAM)
Calibration wall-clock 12.75 min (via modelopt-fast-moe adaptive batching)
Excluded from quantization vision_tower, embed_vision, multi_modal_projector, routers (BF16)
Exported size 16.42 GB

Why plain NVFP4 instead of NVFP4_AWQ?

Earlier experiments used NVFP4_AWQ_FULL_CFG (AWQ with exhaustive alpha grid search) but ran into a deployment-stack limitation: vLLM's ModelOptNvFp4FusedMoE does not support per-expert pre_quant_scale. On MoE models, AWQ calibration computes a per-expert scaling factor that can't be consumed by the MoE kernel path — any AWQ work on experts is wasted at serve time.

Switching to plain NVFP4 (algorithm=max):

  • Cuts calibration time from ~2.5h to ~12 min (no alpha search phase)
  • Produces a checkpoint vLLM's FusedMoE loads natively without tensor surgery
  • Quality hit is negligible since the AWQ benefit on MoE experts was already unavailable at inference time

Attention and dense shared MLP layers still benefit from NVFP4's per-block scaling. Router weights stay in BF16 (routing quality is critical for MoE accuracy and experts are cheap to leave un-quantized there).

Applied modelopt patches

Two upstream PR fixes applied locally (pending review as of this writing):

  • PR #1264preprocess_linear_fusion non-scalar amax fix
  • PR #1265get_activation_scaling_factor zero-amax handling

Both are blockers for anyone quantizing per-expert-decomposed MoEs in NVFP4 with modelopt 0.42 or 0.43. The -awq container image includes these patches in its baked modelopt.py — do not override with a stock version.

fast-moe adaptive batched calibration

Calibration uses modelopt-fast-moe — adaptive VRAM-probed batching that fixes the Python-dispatch bottleneck when quantizing MoE models (the naive for ids in calib_data: model(ids) loop leaves GPUs at 25-30% utilization).

End-to-end calibration wall-clock:

Configuration Wall-clock
Naive bs=1 loop (modelopt default) ~50h projected (killed at 18h)
fast-moe + NVFP4_AWQ_FULL (earlier v3 attempt) 2h 24min
fast-moe + NVFP4_DEFAULT (this v6) 12 min

NVFP4 Weight Format

Each quantized layer stores:

  • weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
  • weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
  • weight_scale_2 (float32) — per-tensor global scale (stored as modelopt reciprocal convention)
  • input_scale (float32) — static activation scale from calibration

Quality Validation

Greedy-sampled responses (temperature=0.0):

Prompt Response
"What is the capital of France?" "The capital of France is Paris."
"What is 17 * 23?" "391" ✓
"Write a haiku about the ocean." "Blue waves kiss the shore, / Endless water, salt and spray, / Deep blue mysteries."
"Name three cities in Japan." "1. Tokyo 2. Osaka 3. Kyoto"
"One-line Python prime function" Valid implementation with correct base cases
"Explain photosynthesis in one sentence." Correct, coherent summary

Speculative Decoding (DFlash — Coming Soon)

A DFlash block-diffusion drafter paired with this model is in training. DFlash can provide 2-3× additional throughput over the numbers above by predicting multi-token blocks in a single draft forward pass. Will be published as a separate drafter repo once training completes.


Dense (31B) vs MoE (26B) Comparison

Metric 31B DECKARD Dense This Model (26B MoE)
Active params / token 31.3B ~3.8B
NVFP4 model size 20.5 GB 16.4 GB
Single-stream tok/s (Spark) ~11-14 37.8
Peak aggregate (Spark) 1,890 tok/s @ 256
Context window 262K 262K
Vision Yes Yes
Best for Quality-critical tasks Speed, concurrency, efficiency

Hardware Requirements

Tier GPU Notes
Target NVIDIA DGX Spark (128 GB unified) Full 262K context, up to 6 concurrent seqs
Compatible RTX 5090 (32 GB) Reduced context, 1-2 seqs
Compatible B200 / GB200 Full context, high concurrency
Compatible RTX PRO 6000 Blackwell (97 GB) Calibration + serving
Minimum Any Blackwell GPU (SM 10.0+) Required for native FP4

Native FP4 hardware (Blackwell architecture) is required — will not run on Ampere or Ada GPUs.


Related Projects

Models

Model Type Size Link
SuperGemma4 26B NVFP4 (this) MoE NVFP4 16.4 GB GitHub
Gemma-4-26B-A4B-it-Uncensored NVFP4 MoE NVFP4 (compressed-tensors) 15.3 GB HuggingFace
Gemma 4 31B DECKARD Dense NVFP4 AWQ 20.5 GB HuggingFace
gemma-4-31B-it-speculator.eagle3 NVFP4 EAGLE3 drafter NVFP4 3.5 GB HuggingFace

Infrastructure

Resource Description Link
vLLM AWQ Container Pre-built for DGX Spark (SM 12.1) with all patches GHCR
Build System spark-vllm-docker GitHub
modelopt-fast-moe Adaptive batched calibration GitHub
Base Model SuperGemma4 26B Abliterated Multimodal (BF16) HuggingFace

Disclaimer

THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, you expressly acknowledge that you assume full and sole responsibility for all outputs generated, all actions taken based on outputs, and compliance with applicable laws. The authors are not responsible for any harmful, illegal, or objectionable content produced by the model. These tools serve legitimate purposes including security research, red-teaming, content analysis, and creative work. Implement safeguards appropriate to your use case and jurisdiction.


License

This model inherits the Gemma license from Google.

Credits

Quantized by AEON-7 on NVIDIA Blackwell hardware. Built and validated with AI-engineering assistance from Anthropic.

Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build, NVIDIA for TensorRT-Model-Optimizer, and the z-lab / ModelOpt teams for DFlash.

Downloads last month
1,320
Safetensors
Model size
25B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4

Quantized
(10)
this model
Finetunes
1 model