MiniMax-M2.7 REAP-172B-A10B AutoRound W4A16
⚠ Experimental proof-of-concept — quality caveat
This checkpoint was produced as a test of the end-to-end REAP → AutoRound → vLLM pipeline on a single NVIDIA GB10 / DGX Spark. Because the calibration machine had limited memory and wall-clock budget, the REAP expert-saliency pass was run with only 64 calibration sequences from
theblackcat102/evol-codealpaca-v1at sequence length 512. Downstream quality is still being evaluated.
A REAP-pruned and AutoRound W4A16-quantized variant of MiniMax-M2.7. The original 256-expert-per-layer MoE has been reduced to 192 experts per layer (25 % compression) using REAP expert saliency pruning, then quantized to 4-bit weights / 16-bit activations with Intel AutoRound. The result is an ~86 GiB checkpoint that runs comfortably on a single NVIDIA GB10 / DGX Spark (128 GiB unified memory), as well as on any CUDA GPU with vLLM's Marlin / GPTQ-Marlin W4A16 kernel.
- Base model:
MiniMaxAI/MiniMax-M2family (MiniMaxM2 architecture) - Pruning: REAP (Cerebras) — 25 % compression → 192/256 experts per layer
- Calibration:
theblackcat102/evol-codealpaca-v1, 64 samples, seed 42 - Quantization: AutoRound 0.12.2, W4A16, group size 128, symmetric,
GPTQ packing format. MoE router gates, embeddings, layer norms, and
lm_headkept at bf16/fp16. - Architecture: 62 transformer layers, 192 experts/layer, top-8 routing, hidden size 3072, 48 attention heads, 8 KV heads
- Total parameters: ~172 B (A10B — ~10 B activated per token)
- Disk size: ~86 GiB (23 safetensors shards)
Quick start
vLLM (recommended)
vLLM uses its own built-in minimax_m2 model implementation with
FlashInfer attention and the GPTQ-Marlin W4A16 kernel, so this
checkpoint runs out of the box. Tested on vLLM 0.17.1.dev0 (container:
scitrera/dgx-spark-vllm:0.17.0-t5,
vLLM 0.17.1.dev0 + transformers 5.3.0 on Blackwell arm64).
vllm serve MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
Example request:
curl -sS http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16",
"messages": [
{"role": "user", "content": "Write a Python function that returns the two integers in a list whose sum is closest to zero."}
],
"temperature": 0,
"max_tokens": 512
}' | jq '.choices[0].message.content'
DGX Spark / GB10 note: if you serve on a single 128 GiB unified-memory node, leave at least ~20 GiB to the host OS + display / other services before starting vLLM, otherwise the OOM killer may reclaim the engine process during CUDA graph capture. The flags above were validated in that configuration.
HuggingFace Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map={"": "cuda:0"},
)
model.config.use_cache = False # pure-HF path needs this; vLLM is unaffected
prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to("cuda:0")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=64, do_sample=False, use_cache=False)
print(tok.decode(out[0], skip_special_tokens=True))
trust_remote_code=True is required because the checkpoint ships its own
modeling_minimax_m2.py that uses the per-expert w1/w2/w3 layout. The
bundled modeling file includes a small compat shim so it works on both
transformers 4.55+ and transformers 5.x.
Benchmarks (llama-benchy, vLLM 0.17.1 on GB10)
Measured with uvx llama-benchy --latency-mode generation --skip-coherence
against the vLLM server running with the launch flags above
(--kv-cache-dtype fp8, --max-model-len 32768). Single request, batch
size 1, prefill length 2048 tokens at each reported depth, 32 decode
tokens. Values are mean ± stddev over 3 runs.
| depth | prefill tok/s | decode tok/s | TTFT (ms) |
|---|---|---|---|
| 0 | 2469.3 ± 13.3 | 29.28 ± 0.05 | 864.5 |
| 4096 | 2089.9 ± 12.5 | 27.73 ± 0.05 | 2784.8 |
| 8192 | 1890.3 ± 5.2 | 26.28 ± 0.05 | 5062.3 |
| 16384 | 1601.1 ± 6.5 | 23.88 ± 0.05 | 10647.7 |
Decode throughput holds within ~18 % across a 16 K-token prefix — the pruned MoE routing is stable under longer context on this quantization format.
Pruning methodology (REAP)
Starting from the full 256-experts-per-layer MoE, we ran REAP over 62
layers using 64 calibration sequences from
theblackcat102/evol-codealpaca-v1
(seed 42, max sequence length 512) to collect per-expert activation
saliency, then dropped the lowest-saliency 25 % of experts per layer
(64 per layer × 62 layers = 3 968 experts removed), leaving 192
experts per layer. Router gates were re-projected to the reduced expert
index space. No further fine-tuning was applied.
REAP reference: see cerebras/MiniMax-M2-REAP-172B-A10B and the REAP paper from Cerebras Research.
Quantization methodology (AutoRound)
The pruned bf16 model was quantized to W4A16 with Intel AutoRound
0.12.2 in OPT-RTN mode (iters=0, no AdaRound search), group_size=128,
symmetric, GPTQ packing format. The MoE router gates for all 62 layers,
embeddings, norms, and lm_head were kept at fp16/bf16 via the
extra_config exemption list, so only FFN expert projections and
attention q/k/v/o projections are int4. Under vLLM this dispatches to the
GPTQMarlinLinearMethod + MarlinLinearKernel path automatically.
Files in this repo
config.json # model config (backend=auto for vLLM)
generation_config.json
quantization_config.json # AutoRound W4A16 sidecar
configuration_minimax_m2.py # custom config class
modeling_minimax_m2.py # modeling file with tf5 ROPE compat shim
tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
vocab.json
merges.txt
chat_template.jinja
model.safetensors.index.json
model-0000{1..23}-of-00023.safetensors
License
This derivative inherits the MiniMax-M2.7 Non-Commercial License from
the upstream MiniMaxAI/MiniMax-M2.7.
See LICENSE for the full text. Key points:
- Non-commercial use is free for personal, research, academic, and non-profit purposes, including self-hosted deployment, experimentation, and educational use.
- Commercial use requires prior written authorization from MiniMax —
contact
api@minimax.iowith the subject line "M2.7 licensing". - Derivative works (including REAP-pruned and quantized variants such as this one) are covered by the same non-commercial terms.
- If you obtain commercial authorization and deploy this model, you must prominently display "Built with MiniMax M2.7" on the related website, user interface, blog post, about page, or product documentation.
- Prohibited uses include: illegal content, military applications, harming
minors, generating harmful misinformation, and promoting discrimination
or hate speech. See the
Appendix: Prohibited Usessection ofLICENSEfor the full list.
Credits
- MiniMaxAI for the MiniMax-M2 base model and architecture
- Cerebras Research for the REAP expert pruning methodology
- Intel Neural Compressor team for AutoRound
- scitrera/dgx-spark-vllm for the prebuilt vLLM arm64 container that made serving this model on GB10 trivially reproducible
- vLLM project for the Marlin W4A16 and FlashInfer attention kernels
Citation
If you use this model, please cite the base MiniMax-M2 release, the REAP paper, and AutoRound.
- Downloads last month
- 1,063
Model tree for MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16
Base model
MiniMaxAI/MiniMax-M2.7