MiniMax-M2.7 REAP-172B-A10B AutoRound W4A16

⚠ Experimental proof-of-concept — quality caveat

This checkpoint was produced as a test of the end-to-end REAP → AutoRound → vLLM pipeline on a single NVIDIA GB10 / DGX Spark. Because the calibration machine had limited memory and wall-clock budget, the REAP expert-saliency pass was run with only 64 calibration sequences from theblackcat102/evol-codealpaca-v1 at sequence length 512. Downstream quality is still being evaluated.

A REAP-pruned and AutoRound W4A16-quantized variant of MiniMax-M2.7. The original 256-expert-per-layer MoE has been reduced to 192 experts per layer (25 % compression) using REAP expert saliency pruning, then quantized to 4-bit weights / 16-bit activations with Intel AutoRound. The result is an ~86 GiB checkpoint that runs comfortably on a single NVIDIA GB10 / DGX Spark (128 GiB unified memory), as well as on any CUDA GPU with vLLM's Marlin / GPTQ-Marlin W4A16 kernel.

Base model: MiniMaxAI/MiniMax-M2 family (MiniMaxM2 architecture)
Pruning: REAP (Cerebras) — 25 % compression → 192/256 experts per layer
Calibration: theblackcat102/evol-codealpaca-v1, 64 samples, seed 42
Quantization: AutoRound 0.12.2, W4A16, group size 128, symmetric, GPTQ packing format. MoE router gates, embeddings, layer norms, and lm_head kept at bf16/fp16.
Architecture: 62 transformer layers, 192 experts/layer, top-8 routing, hidden size 3072, 48 attention heads, 8 KV heads
Total parameters: ~172 B (A10B — ~10 B activated per token)
Disk size: ~86 GiB (23 safetensors shards)

Quick start

vLLM (recommended)

vLLM uses its own built-in minimax_m2 model implementation with FlashInfer attention and the GPTQ-Marlin W4A16 kernel, so this checkpoint runs out of the box. Tested on vLLM 0.17.1.dev0 (container: scitrera/dgx-spark-vllm:0.17.0-t5, vLLM 0.17.1.dev0 + transformers 5.3.0 on Blackwell arm64).

vllm serve MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 \
    --host 0.0.0.0 --port 8000 \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.80 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

Example request:

curl -sS http://localhost:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16",
      "messages": [
        {"role": "user", "content": "Write a Python function that returns the two integers in a list whose sum is closest to zero."}
      ],
      "temperature": 0,
      "max_tokens": 512
    }' | jq '.choices[0].message.content'

DGX Spark / GB10 note: if you serve on a single 128 GiB unified-memory node, leave at least ~20 GiB to the host OS + display / other services before starting vLLM, otherwise the OOM killer may reclaim the engine process during CUDA graph capture. The flags above were validated in that configuration.

HuggingFace Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map={"": "cuda:0"},
)
model.config.use_cache = False  # pure-HF path needs this; vLLM is unaffected

prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to("cuda:0")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False, use_cache=False)
print(tok.decode(out[0], skip_special_tokens=True))

trust_remote_code=True is required because the checkpoint ships its own modeling_minimax_m2.py that uses the per-expert w1/w2/w3 layout. The bundled modeling file includes a small compat shim so it works on both transformers 4.55+ and transformers 5.x.

Benchmarks (llama-benchy, vLLM 0.17.1 on GB10)

Measured with uvx llama-benchy --latency-mode generation --skip-coherence against the vLLM server running with the launch flags above (--kv-cache-dtype fp8, --max-model-len 32768). Single request, batch size 1, prefill length 2048 tokens at each reported depth, 32 decode tokens. Values are mean ± stddev over 3 runs.

depth	prefill tok/s	decode tok/s	TTFT (ms)
0	2469.3 ± 13.3	29.28 ± 0.05	864.5
4096	2089.9 ± 12.5	27.73 ± 0.05	2784.8
8192	1890.3 ± 5.2	26.28 ± 0.05	5062.3
16384	1601.1 ± 6.5	23.88 ± 0.05	10647.7

Decode throughput holds within ~18 % across a 16 K-token prefix — the pruned MoE routing is stable under longer context on this quantization format.

Pruning methodology (REAP)

Starting from the full 256-experts-per-layer MoE, we ran REAP over 62 layers using 64 calibration sequences from theblackcat102/evol-codealpaca-v1 (seed 42, max sequence length 512) to collect per-expert activation saliency, then dropped the lowest-saliency 25 % of experts per layer (64 per layer × 62 layers = 3 968 experts removed), leaving 192 experts per layer. Router gates were re-projected to the reduced expert index space. No further fine-tuning was applied.

REAP reference: see cerebras/MiniMax-M2-REAP-172B-A10B and the REAP paper from Cerebras Research.

Quantization methodology (AutoRound)

The pruned bf16 model was quantized to W4A16 with Intel AutoRound 0.12.2 in OPT-RTN mode (iters=0, no AdaRound search), group_size=128, symmetric, GPTQ packing format. The MoE router gates for all 62 layers, embeddings, norms, and lm_head were kept at fp16/bf16 via the extra_config exemption list, so only FFN expert projections and attention q/k/v/o projections are int4. Under vLLM this dispatches to the GPTQMarlinLinearMethod + MarlinLinearKernel path automatically.

Files in this repo

config.json                   # model config (backend=auto for vLLM)
generation_config.json
quantization_config.json      # AutoRound W4A16 sidecar
configuration_minimax_m2.py   # custom config class
modeling_minimax_m2.py        # modeling file with tf5 ROPE compat shim
tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
vocab.json
merges.txt
chat_template.jinja
model.safetensors.index.json
model-0000{1..23}-of-00023.safetensors

License

This derivative inherits the MiniMax-M2.7 Non-Commercial License from the upstream MiniMaxAI/MiniMax-M2.7. See LICENSE for the full text. Key points:

Non-commercial use is free for personal, research, academic, and non-profit purposes, including self-hosted deployment, experimentation, and educational use.
Commercial use requires prior written authorization from MiniMax — contact api@minimax.io with the subject line "M2.7 licensing".
Derivative works (including REAP-pruned and quantized variants such as this one) are covered by the same non-commercial terms.
If you obtain commercial authorization and deploy this model, you must prominently display "Built with MiniMax M2.7" on the related website, user interface, blog post, about page, or product documentation.
Prohibited uses include: illegal content, military applications, harming minors, generating harmful misinformation, and promoting discrimination or hate speech. See the Appendix: Prohibited Uses section of LICENSE for the full list.

Credits

MiniMaxAI for the MiniMax-M2 base model and architecture
Cerebras Research for the REAP expert pruning methodology
Intel Neural Compressor team for AutoRound
scitrera/dgx-spark-vllm for the prebuilt vLLM arm64 container that made serving this model on GB10 trivially reproducible
vLLM project for the Marlin W4A16 and FlashInfer attention kernels

Citation

If you use this model, please cite the base MiniMax-M2 release, the REAP paper, and AutoRound.

Downloads last month: 1,063

Safetensors

Model size

24B params

Tensor type

I32

F16

F32

BF16

Model tree for MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(73)

this model

MJPansa
/

MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16