Instructions to use MiniMaxAI/MiniMax-M3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MiniMaxAI/MiniMax-M3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="MiniMaxAI/MiniMax-M3", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-M3", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("MiniMaxAI/MiniMax-M3", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MiniMaxAI/MiniMax-M3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MiniMaxAI/MiniMax-M3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MiniMaxAI/MiniMax-M3

SGLang

How to use MiniMaxAI/MiniMax-M3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MiniMaxAI/MiniMax-M3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MiniMaxAI/MiniMax-M3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use MiniMaxAI/MiniMax-M3 with Docker Model Runner:
```
docker model run hf.co/MiniMaxAI/MiniMax-M3
```

427B params? This is not intelligence, its brute force.

by Nerdsking - opened 4 days ago

Discussion

Nerdsking

4 days ago

•

edited 4 days ago

Why not 1T already? 2T? It is DOUBLE the size of the previous model. People from Stepfun delivered a REAL sucessor with Step 3.7. Same size, better model.
Gitgud boys.

zxcepsycho

4 days ago

I like how he literally swears because FREE model delivered to him for FREE does not satisfy his 10 years old lowvram hardware.

Nerdsking

3 days ago

•

edited 3 days ago

427 BILLIONS parameters? And that's "MY" fault? Well, incompetence allways finds justification (and the typical brainless user simps to support it...). Specially when Qwen 3.6 with mere 27b is able to do almost as good or better in many aspects... What is clear to me is that it is a purpose action to make the "local" model less local as possible, to force users to BUY the online version. The bad news is that there is COMPETITION. So no, I will not be expending more and more to acomodate the incompetence of others, or be forced by marketing strategy, I will simply shift to a better model, able to the same or better costing less. And I already did, now I am using Step3.7, a model that REALLY evolved from the 3.5 last version.

zxcepsycho

3 days ago

•

edited 3 days ago

427 BILLIONS parameters? And that's "MY" fault? Well, incompetence allways finds justification (and the typical brainless user simps to support it...). Specially when Qwen 3.6 with mere 27b is able to do almost as good or better in many aspects...

ig for coding it might perform well, but other stuff bruh, no, even gemma much is smarter in terms of knowledge..

What is clear to me is that it is a purpose action to make the "local" model less local as possible, to force users to BUY the online version.

Why not a local server?

The bad news is that there is COMPETITION. So no, I will not be expending more and more to acomodate the incompetence of others, or be forced by marketing strategy, I will simply shift to a better model, able to the same or better costing less. And I already did, now I am using Step3.7, a model that REALLY evolved from the 3.5 last version.

wdym, are you a local service customer or something I don't exactly understand. you expect a model you can indeed run locally to perform on consumer hardware, which barely fit these data outside of benchmarks into that small params count....

Ont

3 days ago

A variant of MiniMax M3 closer to the 229 billion parameter count of MiniMax M2.7 and M2.5 is something I'd much like to see.
Step-3.7-Flash remains at 201 billion parameters, but isn't as capable as MiniMax M3.

Yet compare MiniMax M3 with 428 billion parameters to GLM-5.1 with 754 billion parameters.

Thank you MiniMax M3 developers for this release!

Jahaz

3 days ago

lol, can someone give me an example about how MiniMax 3 beat Qwen 3.6 27b ?

ianncity

2 days ago

"your opensource model you spent tens of thousands training isn't to my specifications"

paragon-of-brah

1 day ago

427 BILLIONS parameters? And that's "MY" fault? Well, incompetence allways finds justification (and the typical brainless user simps to support it...). Specially when Qwen 3.6 with mere 27b is able to do almost as good or better in many aspects... What is clear to me is that it is a purpose action to make the "local" model less local as possible, to force users to BUY the online version. The bad news is that there is COMPETITION. So no, I will not be expending more and more to acomodate the incompetence of others, or be forced by marketing strategy, I will simply shift to a better model, able to the same or better costing less. And I already did, now I am using Step3.7, a model that REALLY evolved from the 3.5 last version.

Are you angry because there is a model on the internet that you cannot run?

Do you also get angry at the ice cream shop for selling flavours you don't like? 🫠

Let the bros cook. This model beats Deepseek V4 pro (1.6T) with a quarter of the parameters while having vision. It's an outstanding model. Cheers to the MiniMax team!

nawoalanor

1 day ago

•

edited 1 day ago

Not going to look a gift horse in the mouth but MiniMax has been picking really inconvenient sizes... M2.7 needed to be just a few percent smaller to fit in DGX Spark in FP8 with full context length, and now M3 needed to be just a few percent smaller to fit in NVFP4 with full context length. Though, to be fair, M3 would be too slow regardless.

I imagine these sizes are being chosen for a specific reason but I can't help being a bit annoyed. Targeting 128GB / 256GB / 512GB seems like the most sensible way to go but I don't have billions of dollars of servers to optimize for so what do I know...

paragon-of-brah

1 day ago

Not going to look a gift horse in the mouth but MiniMax has been picking really inconvenient sizes... M2.7 needed to be just a few percent smaller to fit in DGX Spark in FP8 with full context length, and now M3 needed to be just a few percent smaller to fit in NVFP4 with full context length. Though, to be fair, M3 would be too slow regardless.

I imagine these sizes are being chosen for a specific reason but I can't help being a bit annoyed. Targeting 128GB / 256GB / 512GB seems like the most sensible way to go but I don't have billions of dollars of servers to optimize for so what do I know...

The are more sizes other then FP8 and Q4 tho? You can use llama.cpp, ik_llama.cpp, exllama.. in all those cases you can choose whatever quant size you wish. Q6 for minimax M2.7, Q3 for minimax M3.. you can also choose to keep the attention layers in Q8 and the MoE tensors in Q4 to achieve near lossless quantization.

Exllama even allows decimal quant size, like 3.3bpw.

You have a setup problem. The problem isn't the model.

Nerdsking

1 day ago

Not going to look a gift horse in the mouth but MiniMax has been picking really inconvenient sizes... M2.7 needed to be just a few percent smaller to fit in DGX Spark in FP8 with full context length, and now M3 needed to be just a few percent smaller to fit in NVFP4 with full context length. Though, to be fair, M3 would be too slow regardless.

I imagine these sizes are being chosen for a specific reason but I can't help being a bit annoyed. Targeting 128GB / 256GB / 512GB seems like the most sensible way to go but I don't have billions of dollars of servers to optimize for so what do I know...

The are more sizes other then FP8 and Q4 tho? You can use llama.cpp, ik_llama.cpp, exllama.. in all those cases you can choose whatever quant size you wish. Q6 for minimax M2.7, Q3 for minimax M3.. you can also choose to keep the attention layers in Q8 and the MoE tensors in Q4 to achieve near lossless quantization.

Exllama even allows decimal quant size, like 3.3bpw.

You have a setup problem. The problem isn't the model.

Imagine the guy delivering the "new model" of the car you already have, but double the size of the last one... And when you complain the "smarty pants" argue "hey... its a garage problem, not the car..."... LMAO!
But ok, I understand, not everyone dominates logic...

But thank you for being entertaining. You should work as comediant (watever you do presently). The laughing quotes seems to come naturaly from your brain...

Nerdsking changed discussion status to closed 1 day ago

Nerdsking changed discussion status to open 1 day ago

paragon-of-brah

1 day ago

Imagine the guy delivering the "new model" of the car you already have, but double the size of the last one... And when you complain the "smarty pants" argue "hey... its a garage problem, not the car..."... LMAO!
But ok, I understand, not everyone dominates logic...

But thank you for being entertaining. You should work as comediant (watever you do presently). The laughing quotes seems to come naturaly from your brain...

You should probably have a shower. Relax. This isn't Twitter. Behave.

KarlKlaussen

about 6 hours ago

•

edited about 6 hours ago

Yeah it is free, but getting something for free without being able to use it is basically pointless. 99.9% of people won't be able to run it. It is like giving a shower away for free to a guy without water.

They should start splitting the training data so we get topic related models like
MiniMax-Coding
Minimax-Physics
Minimax-Biology
and so on. This would make more sense in my opinion.

Right now it is probably more reasonable for most to get a small Qwen model and refine it with own data or use RAG.

KarlKlaussen

about 4 hours ago

You have a setup problem. The problem isn't the model.

MiniMax 2.7 230 GB
MiniMax 3 850 GB
When the model is more than 3x in size just to improve like 20% or whatever, then the model is the problem. This whole concept feels like a dead end. Will we need 2.5 TB in RAM for the next 20% improvement? It is probably better to use MM2.7 than some Q3 Version of MM3. Would really like to see some Benchmarks for this.

paragon-of-brah

about 3 hours ago

You have a setup problem. The problem isn't the model.

MiniMax 2.7 230 GB
MiniMax 3 850 GB
When the model is more than 3x in size just to improve like 20% or whatever, then the model is the problem. This whole concept feels like a dead end. Will we need 2.5 TB in RAM for the next 20% improvement? It is probably better to use MM2.7 than some Q3 Version of MM3. Would really like to see some Benchmarks for this.

Minimax 2.7 was a 230B model, so it was around 460 GB, meaning M3 it's less then a 2x in size. Also, M3 will run with close to no loss quantized in a system with 256/192GB of RAM + 24/32GB GPU. I get around 15 t/s on a similar model at 250k context (9950x3d2, 256GB ddr5, 5090). Yeah, RAM prices are ridiculous, but that's not on the Minimax team. This model runs on a strong desktop PC no problem.

Also, what does improve by 20% even mean? That it gets 20% more in benchmarks? Is a score of 100% only 20% better intelligence then a score of 80%? The "not enough improvement" point also seems moot.

Listen, I know what you're doing. You want to peer pressure companies into pandering to your wants. It won't work, it's extremely rude to the devs, it looks really bad, it makes the community toxic, and it's extremely self centred. You have plenty of models out there that you can use on your GPU (Qwen 3.6 27B, Nex N2 mini, etc). Use those.

Again, insulting the Ferrari team because they don't make cheap japanese Kei cars. It's an ugly look. Please stop. Let's keep the community honest and respectful towards the people pushing the tech forward.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment