SuperGemma4 26B Abliterated Multimodal — NVFP4
NVFP4-quantized version of Jiunsong/supergemma4-26b-abliterated-multimodal — an abliterated (uncensored) Gemma 4 26B Mixture-of-Experts multimodal model with thinking/reasoning capabilities.
Quantized using NVIDIA ModelOpt 0.43 (main) with NVFP4_DEFAULT_CFG on a native Blackwell GPU. Vision encoder preserved in full BF16. Peak aggregate throughput: 1,890 tok/s @ 256 concurrent on DGX Spark (GB10).
Verified end-to-end: calibrated → exported → served on Spark → benchmarked 1-256 concurrency.
⚠️ IMPORTANT REQUIREMENTS — READ THIS FIRST
This model has non-obvious serving requirements because its per-expert-decomposed NVFP4 scale format needs specific plugin handling. Deviating from these will produce garbage output or crashes. Details below — each requirement is backed by hours of debugging.
🔴 MUST-DO requirements
| # | Requirement | Why |
|---|---|---|
| 1 | Use the -awq container image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest |
Has baked-in modelopt.py scale handling (PRs #1264, #1265) + patched expert_params_mapping. The non--awq variant (or any stock vLLM) will crash or produce corrupted output. |
| 2 | Force Marlin MoE: set env VLLM_TEST_FORCE_FP8_MARLIN=1 and VLLM_MARLIN_USE_ATOMIC_ADD=1 |
FlashInfer NVFP4 MoE backends reject the 704-per-expert intermediate dim. Do NOT set VLLM_NVFP4_GEMM_BACKEND=marlin — that would also force Marlin on the LINEAR path where native FLASHINFER_CUTLASS is faster. Let linear auto-select; only MoE needs Marlin. |
| 3 | Mount both patches: gemma4_patched.py and serving_chat_patched.py (2 files, NOT 3) |
modelopt_patched.py is baked into the image — mounting a stock version on top will corrupt the scale convention. Only mount the two listed here. |
| 4 | Use --quantization modelopt (not compressed-tensors) |
This checkpoint uses the modelopt NVFP4 format. compressed-tensors looks for different key names and will fail to load. |
| 5 | Native Blackwell GPU required (SM 10.0+) | Ampere / Ada GPUs have no native FP4 compute path. Verified on: GB10 (SM 12.0, DGX Spark), RTX PRO 6000 Blackwell (SM 12.1), should work on B200/GB200 (SM 10.0). |
| 6 | Use vLLM 0.19.1rc1.dev110 or later with Blackwell-compiled FP4 kernels | Stock vLLM wheels don't compile FP4 kernels for SM 10/12. Use the pre-built container, or build from source with TORCH_CUDA_ARCH_LIST="10.0;12.0;12.1" + transformers 5.5+. |
✅ Verified-working config (use this verbatim)
services:
vllm:
image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest # NOT the non-awq variant
environment:
- VLLM_TEST_FORCE_FP8_MARLIN=1 # required for MoE
- VLLM_MARLIN_USE_ATOMIC_ADD=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- NVIDIA_FORWARD_COMPAT=1
volumes:
# model + 2 patches ONLY — do not mount modelopt_patched.py (it's baked in)
- ./model:/models/supergemma4
- ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
- ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
command: >
vllm serve /models/supergemma4
--quantization modelopt
--kv-cache-dtype fp8_e4m3
--tensor-parallel-size 1
--max-model-len 65536
--max-num-seqs 4
--gpu-memory-utilization 0.85
--trust-remote-code
--host 0.0.0.0 --port 8000
--enable-chunked-prefill
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser gemma4
--reasoning-parser gemma4
❌ Known FAILURE modes (things that DON'T work)
| What you might try | What happens | Fix |
|---|---|---|
Use vllm-spark-gemma4-nvfp4:latest (non-awq image) |
Model loads but produces "-Hello-Hello-" or empty content |
Switch to -awq variant |
Drop VLLM_TEST_FORCE_FP8_MARLIN=1 — let native FP4 MoE auto-select |
KeyError during weight load OR corrupted output |
Re-add the env var |
Mount your own modelopt_patched.py on top |
Scale values get double-inverted → garbage output | Remove the mount; use image's baked version |
Use --quantization compressed-tensors |
KeyError: 'weight_packed' during load |
Use modelopt |
Use --kv-cache-dtype fp8 instead of fp8_e4m3 |
Works, but slight accuracy drift on long context | Use fp8_e4m3 as specified |
Mount eagle_patched.py for spec decode |
AttributeError: image_token_index |
Gemma4 EAGLE3 not yet supported upstream; omit spec decode for now |
🐛 If you see gibberish output after following all of the above
- Verify the image:
docker inspect <container> | grep Imageshould showvllm-spark-gemma4-nvfp4-awq - Verify mounts:
docker inspect <container> --format '{{json .Mounts}}'should show exactly 3 mounts (model + 2 patches) - Verify backend selection in logs:
Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM✅Using 'MARLIN' NvFp4 MoE backend✅
- Test with raw chat:
{"messages": [{"role": "user", "content": "Capital of France? One sentence."}]}— should return"The capital of France is Paris.". If not, check the container logs for crashes orUNEXPECTEDtensor warnings at load time.
Performance Benchmarks
NVIDIA DGX Spark (GB10, SM 12.0, 128 GB unified memory) — vLLM 0.19.1rc1.dev110+gb55d830ec, FP8 E4M3 KV cache, native FlashInfer CUTLASS linear + Marlin MoE backend, --gpu-memory-utilization 0.85.
1. Single-Stream Performance (README-spec config)
--max-num-seqs 4, --max-model-len 65536, --gpu-memory-utilization 0.85. Best for interactive chat, agentic UX, single-user serving. All measurements greedy sampling (temp=0) unless noted.
Decode rate (10 trials, 200 tokens output)
| Statistic | tok/s |
|---|---|
| Median | 51.1 |
| P95 | 51.5 |
| Min | 50.9 |
| Max | 51.5 |
Extremely stable — ±0.5 tok/s variance across 10 trials.
TTFT by prompt length
Time from request to first token, across 5 trials each:
| Prompt class | Prompt tokens | TTFT median | TTFT p95 | TTFT min | Effective prefill |
|---|---|---|---|---|---|
| Tiny | 14 | 56 ms | 58 ms | 55 ms | 250 tok/s |
| Short | 19 | 49 ms | 61 ms | 48 ms | 386 tok/s |
| Medium | 49 | 45 ms | 46 ms | 44 ms | 1,093 tok/s |
| Long | 465 | 47 ms | 47 ms | 45 ms | 9,996 tok/s |
Even 465-token prompts give sub-50ms TTFT — fixed kernel-launch overhead dominates over prefill for anything < ~500 tokens.
Decode rate by output length
Longer outputs are slightly slower due to growing KV cache:
| Max tokens | Actual tokens | TTFT | Decode rate | Total latency |
|---|---|---|---|---|
| 50 | 50 | 49 ms | 51.9 tok/s | 1.01 s |
| 200 | 200 | 50 ms | 50.9 tok/s | 3.98 s |
| 500 | 500 | 49 ms | 50.7 tok/s | 9.90 s |
| 1000 | 558* | 61 ms | 50.6 tok/s | 11.10 s |
*Short because model hit EOS naturally before 1000 tokens.
Sampling: Greedy vs Stochastic
Temperature has negligible performance impact:
| Mode | Decode median | Decode p95 | TTFT median |
|---|---|---|---|
| Greedy (temp=0) | 51.9 tok/s | 52.1 tok/s | 48 ms |
| Stochastic (temp=0.7) | 51.0 tok/s | 51.2 tok/s | 51 ms |
Long-prompt prefill (RAG / document workloads)
Prefill throughput scales impressively with length — MoE's sparse compute is the perfect shape for prefill:
| Target prompt tokens | Actual | TTFT | Prefill rate | Decode rate (after prefill) |
|---|---|---|---|---|
| 1K | 809 | 0.06 s | 14,450 tok/s | 52.1 tok/s |
| 4K | 3,172 | 0.05 s | 66,851 tok/s | 51.9 tok/s |
| 16K | 12,625 | 0.09 s | 139,648 tok/s | 50.6 tok/s |
| 32K | 25,228 | 0.13 s | 194,082 tok/s | 48.3 tok/s |
Decode only drops 7% at 32K context — excellent KV-cache bandwidth behavior. Prefill peaks around 194K tok/s at 32K prompt length.
Summary
| Metric | Value |
|---|---|
| Single-stream decode (200-tok output) | 51.1 tok/s median |
| Short-prompt TTFT | 44-56 ms |
| 16K-prompt TTFT | 90 ms |
| 32K-prompt TTFT | 130 ms |
| Peak prefill throughput | 194K tok/s @ 32K prompt |
| Decode rate with 32K context | 48.3 tok/s (7% drop vs short context) |
This matches and exceeds the original v6 validation (52.6 tok/s / 54 ms TTFT).
2. Concurrent-Session Performance (max-throughput config)
--max-num-seqs 256, --max-model-len 2048, --max-num-batched-tokens 16384, --gpu-memory-utilization 0.85. Best for agent fleets, multi-user serving, batch inference. 3 trials per level with median reported. Mixed prompts (code, math, QA, creative), 200 token output, temp=0.7, SSE streaming.
Throughput scaling (N concurrent clients, 200-tok output)
| Concurrent | Err | Agg tok/s (median of 3) | Per-Req decode p50 | Per-Req decode min | TTFT p50 | TTFT p95 | TTFT max |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 35.9 | 50.8 | 50.8 | 64ms | 64ms | 64ms |
| 2 | 0 | 51.8 | 48.3 | 45.9 | 59ms | 59ms | 59ms |
| 4 | 0 | 88.6 | 39.0 | 36.4 | 70ms | 71ms | 71ms |
| 8 | 0 | 149.1 | 31.7 | 28.8 | 135ms | 135ms | 135ms |
| 16 | 0 | 269.6 | 24.9 | 23.1 | 149ms | 150ms | 150ms |
| 32 | 0 | 422.3 | 20.0 | 18.5 | 194ms | 195ms | 195ms |
| 64 | 0 | 711.5 | 16.6 | 15.6 | 284ms | 285ms | 286ms |
| 128 | 0 | 1,154.0 | 13.8 | 13.2 | 449ms | 545ms | 548ms |
| 256 | 0 | 1,775.9 | 10.7 | 6.5 | 851ms | 863ms | 864ms |
Zero errors across 1,200+ requests in the full test. Aggregate throughput scales nearly linearly up to 128 concurrent, with diminishing returns at 256 as scheduling and KV-cache contention dominate.
Note: single-stream here is 35.9 tok/s (vs 51.1 in README config) because max-num-seqs=256 forces allocation of 50+ CUDA graph sizes and different scheduling heuristics that optimize for batched throughput over single-stream latency. Use README config for chat; use this config for fleets.
TTFT-only scaling (prefill + first token, 1-token output)
Measures how much queue contention affects time-to-first-token — critical for agent UX:
| Concurrent | TTFT p50 | TTFT p95 | TTFT max | TTFT min |
|---|---|---|---|---|
| 1 | 47ms | 48ms | 48ms | 46ms |
| 4 | 48ms | 88ms | 88ms | 46ms |
| 16 | 100ms | 102ms | 102ms | 55ms |
| 64 | 171ms | 174ms | 175ms | 91ms |
| 256 | 522ms | 527ms | 530ms | 167ms |
TTFT stays sub-200ms up through 64 concurrent — smooth UX for small agent fleets. Above 128 concurrent TTFT doubles per level as requests queue for scheduler capacity.
Concurrent with 1K-token prompts (RAG-style workload)
50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses:
| Concurrent | Err | Agg tok/s | TTFT p50 | TTFT p95 | Decode p50 |
|---|---|---|---|---|---|
| 1 | 0 | 42.8 | 55ms | 55ms | 49.3 |
| 4 | 0 | 109.2 | 82ms | 103ms | 38.7 |
| 16 | 0 | 272.2 | 147ms | 147ms | 27.0 |
| 64 | 0 | 711.9 | 261ms | 293ms | 16.9 |
Long-prompt concurrent workloads scale as well as short-prompt ones (prefill is very fast on MoE with 194K tok/s peak throughput).
Summary
| Metric | Value |
|---|---|
| Peak aggregate throughput | 1,776 tok/s @ 256 concurrent (median of 3 trials) |
| Scaling from 1 → 256 | 49.5× throughput (ideal would be 256×) |
| Per-request decode @ 256 | 10.7 tok/s median, 6.5 min |
| Peak server-reported generation | 2,022 tok/s (vLLM engine stats) |
| Peak combined (prompt + gen) | 2,627 tok/s |
| TTFT @ 64 concurrent | 284 ms median (usable) |
| TTFT @ 256 concurrent | 851 ms median (acceptable for batch) |
| Error rate across full test | 0.0% (1,200+ requests) |
| Best concurrency for chat UX | 4-8 (per-request 30-40 tok/s, TTFT <150ms) |
| Best concurrency for throughput | 128-256 (maxes aggregate, TTFT trade-off) |
Key Performance Metrics
| Metric | Value |
|---|---|
| Single-stream decode (README config) | 52.2 tok/s |
| Short-prompt TTFT (README config) | 44 ms |
| Peak aggregate throughput (bench config) | 1,890 tok/s @ 256 concurrent |
| Peak server-reported generation | 2,022 tok/s (vLLM engine stats) |
| Peak combined (prompt + gen) | 2,627 tok/s |
| Model load time | ~4-5 min (weight load + torch.compile + CUDA graphs + FP4 autotune) |
| Model memory footprint | 16.4 GB |
| KV cache capacity | ~700K tokens @ fp8_e4m3 |
| GEMM backend (linear) | FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores) |
| MoE backend | MARLIN (required — FlashInfer MoE variants reject 704-per-expert intermediate) |
| Attention backend | TRITON_ATTN (heterogeneous head dims require Triton) |
| Prefix cache hit rate | ~70-80% (sustained, mixed workload) |
Scaling Efficiency
| Concurrency | Throughput Gain vs 1-req |
|---|---|
| 1 | 1.0x |
| 4 | 2.7x |
| 16 | 7.7x |
| 64 | 19.8x |
| 128 | 32.8x |
| 256 | 50.0x |
Aggregate throughput scales 50x from 1 to 256 concurrent requests — excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 37.8 tok/s (1-req) to 9.3 tok/s (256-req), still usable for agent workloads with many short-lived subagents.
Why MoE is Fast on DGX Spark
GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces per-token bandwidth demand:
| Model | Params Read/Token | BW Required @ 50 tok/s | Fits GB10? |
|---|---|---|---|
| Dense 27B (BF16) | ~54 GB | 2,700 GB/s | No |
| Dense 27B (NVFP4) | ~13.5 GB | 675 GB/s | No |
| MoE 26B top-8/128 (NVFP4) | ~2.8 GB | 140 GB/s | Yes (51% BW) |
Key Specs
| Original (BF16) | NVFP4 (this model) | |
|---|---|---|
| Size on disk | ~49 GB | ~16.4 GB |
| Total parameters | 25.2B | 25.2B |
| Active parameters | 3.8B / token | 3.8B / token |
| Architecture | MoE: 128 experts, 8 active / token | same |
| Context window | 262K tokens | 262K tokens |
| Modalities | Text, Image, Video | Text, Image, Video |
| Quantization | — | NVFP4 (W4A4, block size 16) |
| Vision encoder | BF16 | BF16 (preserved, not quantized) |
Model Details
| Property | Value |
|---|---|
| Architecture | Gemma 4 MoE (26B total, 3.8B active / token) |
| Layers | 30 (25 sliding-window + 5 full-attention) |
| Experts | 128 total, top-8 active per token |
| Sliding Window | 1024 tokens |
| Max Context | 262,144 tokens |
| Hidden Size | 2816 |
| MoE Intermediate | 704 per expert |
| Attention Heads | 16 (8 KV heads), head_dim=256, global_head_dim=512 |
| Vision Encoder | 27-layer ViT (1152 hidden, 16 heads, patch_size=16) |
| Vocabulary | 262,144 tokens |
| Quantization | NVFP4 (ModelOpt 0.43 main + 2 pending PRs) |
Pre-Built Container Image
A pre-built vLLM container compiled for NVIDIA DGX Spark (GB10, SM 12.1) is available with all required patches pre-applied:
docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
Image contents:
- vLLM 0.19.1rc1 compiled for SM 12.1 (Blackwell GB10)
- PyTorch 2.12.0 + CUDA 13.0
- transformers 5.5.0 + FlashInfer 0.6.7
- Patched
gemma4.py— extendsexpert_params_mappingto the modelopt suffix set (weight,weight_scale,weight_scale_2,input_scale) - Patched
serving.py— fixes non-streaming reasoning parser for Gemma 4 - Patched
modelopt.py— handles the per-expert-decomposed NVFP4 scale format - Built from eugr/spark-vllm-docker with
--tf5flag
Critical: Use the
-awqvariant of the image. The non--awqimage does not include the baked-in modelopt scale-handling patches required for this model's per-expert NVFP4 format.
Quick Start
1. Pull the container
docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
2. Download the model
pip install -U huggingface-hub hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 \
hf download AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
--local-dir ~/models/supergemma4-26b
3. Get the patches
Only two patch files need to be mounted — modelopt.py is baked into the -awq image:
for f in gemma4_patched.py serving_chat_patched.py; do
curl -LO https://raw.githubusercontent.com/AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4/main/$f
done
4. Launch with Docker Compose
Save as docker-compose.yml:
services:
vllm:
image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
container_name: vllm-supergemma4-26b
restart: unless-stopped
network_mode: host
volumes:
- ~/models/supergemma4-26b:/models/supergemma4
- ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
- ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
environment:
# Force Marlin MoE path — native FlashInfer MoE variants reject 704-per-expert intermediate
- VLLM_TEST_FORCE_FP8_MARLIN=1
- VLLM_MARLIN_USE_ATOMIC_ADD=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- NVIDIA_FORWARD_COMPAT=1
command:
- bash
- -c
- |
exec vllm serve /models/supergemma4 \
--served-model-name supergemma4-26b \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--host 0.0.0.0 --port 8000 \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Then:
docker compose up -d
Startup takes ~4-5 minutes (weight load + torch.compile + CUDA graph capture + FP4 GEMM autotuning).
Workload-tuned configs
| Workload | max-model-len | max-num-seqs | Best for |
|---|---|---|---|
| Long-context (RAG, docs) | 65536 | 4 | Few long conversations |
| Balanced | 8192 | 32 | Mixed chat + agents |
| Max throughput | 2048 | 256 | Many short agents (1,890 tok/s) |
| Max context | 262144 | 1 | Single-stream, max window |
5. Test
# Text
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "supergemma4-26b",
"messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
"max_tokens": 300
}'
# Vision
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "supergemma4-26b",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},
{"type": "text", "text": "Describe what you see."}
]
}],
"max_tokens": 300
}'
The API is fully OpenAI-compatible — use with any OpenAI SDK, LangChain, LiteLLM, Open WebUI at http://<your-ip>:8000/v1.
Key Deployment Flags
| Flag | Purpose |
|---|---|
VLLM_TEST_FORCE_FP8_MARLIN=1 |
Required — forces Marlin MoE path (FlashInfer NVFP4 MoE backends reject 704-intermediate) |
--quantization modelopt |
Required — tells vLLM to use NVIDIA ModelOpt NVFP4 format |
--kv-cache-dtype fp8_e4m3 |
FP8 KV cache — doubles token budget vs BF16 |
--max-model-len 65536 |
64K context. Model supports 262K; trade for concurrency |
--max-num-seqs 4 |
README default. Bump to 256 for max throughput workloads |
--gpu-memory-utilization 0.85 |
MoE leaves 15% headroom; tune for your hardware |
--reasoning-parser gemma4 |
Extracts <think> blocks into reasoning_content in API response |
--tool-call-parser gemma4 |
Native Gemma 4 function/tool calling |
--enable-chunked-prefill |
Processes long prompts in chunks |
--enable-prefix-caching |
Caches common system prompt prefixes |
Quantization Details
| Parameter | Value |
|---|---|
| Tool | NVIDIA ModelOpt 0.43.0rc2.dev (from upstream main) |
| Config | NVFP4_DEFAULT_CFG (plain NVFP4, no AWQ) |
| Weight dtype | NVFP4 (FP4 E2M1, block size 16) |
| Calibration samples | 512 (CNN/DailyMail train split) |
| Calibration seq_len | 4096 tokens |
| Batch size | 3 (VRAM-probed) |
| Calibration hardware | NVIDIA RTX PRO 6000 Blackwell (97 GB VRAM) |
| Calibration wall-clock | 12.75 min (via modelopt-fast-moe adaptive batching) |
| Excluded from quantization | vision_tower, embed_vision, multi_modal_projector, routers (BF16) |
| Exported size | 16.42 GB |
Why plain NVFP4 instead of NVFP4_AWQ?
Earlier experiments used NVFP4_AWQ_FULL_CFG (AWQ with exhaustive alpha grid search) but ran into a deployment-stack limitation: vLLM's ModelOptNvFp4FusedMoE does not support per-expert pre_quant_scale. On MoE models, AWQ calibration computes a per-expert scaling factor that can't be consumed by the MoE kernel path — any AWQ work on experts is wasted at serve time.
Switching to plain NVFP4 (algorithm=max):
- Cuts calibration time from ~2.5h to ~12 min (no alpha search phase)
- Produces a checkpoint vLLM's FusedMoE loads natively without tensor surgery
- Quality hit is negligible since the AWQ benefit on MoE experts was already unavailable at inference time
Attention and dense shared MLP layers still benefit from NVFP4's per-block scaling. Router weights stay in BF16 (routing quality is critical for MoE accuracy and experts are cheap to leave un-quantized there).
Applied modelopt patches
Two upstream PR fixes applied locally (pending review as of this writing):
- PR #1264 —
preprocess_linear_fusionnon-scalar amax fix - PR #1265 —
get_activation_scaling_factorzero-amax handling
Both are blockers for anyone quantizing per-expert-decomposed MoEs in NVFP4 with modelopt 0.42 or 0.43. The -awq container image includes these patches in its baked modelopt.py — do not override with a stock version.
fast-moe adaptive batched calibration
Calibration uses modelopt-fast-moe — adaptive VRAM-probed batching that fixes the Python-dispatch bottleneck when quantizing MoE models (the naive for ids in calib_data: model(ids) loop leaves GPUs at 25-30% utilization).
End-to-end calibration wall-clock:
| Configuration | Wall-clock |
|---|---|
| Naive bs=1 loop (modelopt default) | ~50h projected (killed at 18h) |
| fast-moe + NVFP4_AWQ_FULL (earlier v3 attempt) | 2h 24min |
| fast-moe + NVFP4_DEFAULT (this v6) | 12 min |
NVFP4 Weight Format
Each quantized layer stores:
weight(uint8) — packed FP4 E2M1 pairs (16-element blocks)weight_scale(float8_e4m3fn) — per-block scale (1 per 16 elements)weight_scale_2(float32) — per-tensor global scale (stored as modelopt reciprocal convention)input_scale(float32) — static activation scale from calibration
Quality Validation
Greedy-sampled responses (temperature=0.0):
| Prompt | Response |
|---|---|
| "What is the capital of France?" | "The capital of France is Paris." |
| "What is 17 * 23?" | "391" ✓ |
| "Write a haiku about the ocean." | "Blue waves kiss the shore, / Endless water, salt and spray, / Deep blue mysteries." |
| "Name three cities in Japan." | "1. Tokyo 2. Osaka 3. Kyoto" |
| "One-line Python prime function" | Valid implementation with correct base cases |
| "Explain photosynthesis in one sentence." | Correct, coherent summary |
Speculative Decoding (DFlash — Coming Soon)
A DFlash block-diffusion drafter paired with this model is in training. DFlash can provide 2-3× additional throughput over the numbers above by predicting multi-token blocks in a single draft forward pass. Will be published as a separate drafter repo once training completes.
Dense (31B) vs MoE (26B) Comparison
| Metric | 31B DECKARD Dense | This Model (26B MoE) |
|---|---|---|
| Active params / token | 31.3B | ~3.8B |
| NVFP4 model size | 20.5 GB | 16.4 GB |
| Single-stream tok/s (Spark) | ~11-14 | 37.8 |
| Peak aggregate (Spark) | — | 1,890 tok/s @ 256 |
| Context window | 262K | 262K |
| Vision | Yes | Yes |
| Best for | Quality-critical tasks | Speed, concurrency, efficiency |
Hardware Requirements
| Tier | GPU | Notes |
|---|---|---|
| Target | NVIDIA DGX Spark (128 GB unified) | Full 262K context, up to 6 concurrent seqs |
| Compatible | RTX 5090 (32 GB) | Reduced context, 1-2 seqs |
| Compatible | B200 / GB200 | Full context, high concurrency |
| Compatible | RTX PRO 6000 Blackwell (97 GB) | Calibration + serving |
| Minimum | Any Blackwell GPU (SM 10.0+) | Required for native FP4 |
Native FP4 hardware (Blackwell architecture) is required — will not run on Ampere or Ada GPUs.
Related Projects
Models
| Model | Type | Size | Link |
|---|---|---|---|
| SuperGemma4 26B NVFP4 (this) | MoE NVFP4 | 16.4 GB | GitHub |
| Gemma-4-26B-A4B-it-Uncensored NVFP4 | MoE NVFP4 (compressed-tensors) | 15.3 GB | HuggingFace |
| Gemma 4 31B DECKARD | Dense NVFP4 AWQ | 20.5 GB | HuggingFace |
| gemma-4-31B-it-speculator.eagle3 NVFP4 | EAGLE3 drafter NVFP4 | 3.5 GB | HuggingFace |
Infrastructure
| Resource | Description | Link |
|---|---|---|
| vLLM AWQ Container | Pre-built for DGX Spark (SM 12.1) with all patches | GHCR |
| Build System | spark-vllm-docker |
GitHub |
modelopt-fast-moe |
Adaptive batched calibration | GitHub |
| Base Model | SuperGemma4 26B Abliterated Multimodal (BF16) | HuggingFace |
Disclaimer
THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, you expressly acknowledge that you assume full and sole responsibility for all outputs generated, all actions taken based on outputs, and compliance with applicable laws. The authors are not responsible for any harmful, illegal, or objectionable content produced by the model. These tools serve legitimate purposes including security research, red-teaming, content analysis, and creative work. Implement safeguards appropriate to your use case and jurisdiction.
License
This model inherits the Gemma license from Google.
Credits
Quantized by AEON-7 on NVIDIA Blackwell hardware. Built and validated with AI-engineering assistance from Anthropic.
Shout-out to eugr/spark-vllm-docker for the DGX Spark-optimized vLLM build, NVIDIA for TensorRT-Model-Optimizer, and the z-lab / ModelOpt teams for DFlash.
- Downloads last month
- 1,320