Instructions to use DataSnake/Muse-12B-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataSnake/Muse-12B-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DataSnake/Muse-12B-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DataSnake/Muse-12B-NVFP4-FP8") model = AutoModelForCausalLM.from_pretrained("DataSnake/Muse-12B-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use DataSnake/Muse-12B-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DataSnake/Muse-12B-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Muse-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DataSnake/Muse-12B-NVFP4-FP8
- SGLang
How to use DataSnake/Muse-12B-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DataSnake/Muse-12B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Muse-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DataSnake/Muse-12B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Muse-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DataSnake/Muse-12B-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/DataSnake/Muse-12B-NVFP4-FP8
Muse-12B-NVFP4-FP8
Quantized weights of the Muse-12B model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with Four Over Six adaptive block scaling for the MLP layers and FP8_DYNAMIC for the self-attention layers.
More information about the hybrid format here, but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths, and Four Over Six is effectively a free accuracy boost for the NVFP4 parts.
- Paper: Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
- Official Code: GitHub - mit-han-lab/fouroversix
Inference
Tested on a RTX 5060 Ti 16GB with Aphrodite Engine and vLLM. It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the --single-user-mode flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with --max-num-seqs 1 --cudagraph-capture-sizes 2 flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
Details about the 0.20.0 memory issue
From what I can determine, compiling the CUDA graph for the model uses enough VRAM that there's not enough left to allocate the full KV cache. In both cases, the first run mentions saving something to a cache and the second doesn't. And in both cases, the first run reports that has 4.16 GiB of VRAM available for the KV cache before crashing due to lack of memory and the second run has 5.2 and doesn't crash. For reference, a 32768-token KV cache for this model will use precisely 5.00 GiB.
Extra steps when using Aphrodite Engine 0.10.0
For the purposes of these instructions, I'm assuming you have Aphrodite Engine 0.10.0 installed in a Python 3.12 uv venv, as per the official instructions.
First, update compressed-tensors to a more recent version:
uv pip install "compressed-tensors>=0.14.0"
Next, open <venv directory>/lib/python3.12/site-packages/aphrodite/platforms/interface.py in your text editor of choice and comment out or delete lines 487-491. To make sure you're in the right place, the lines should initially look like this:
logger.warning(
"Current platform %s does not have '%s' attribute.",
self.device_type,
key,
)
Recommended Generation Settings
This is a mix of what it says on the Muse-12B model card and the AI Dungeon Model Guide:
- Temperature: 1.0
- Top K: 250
- Top P: 1
- Min P: 0.025
- Repetition Penalty: 1.05
- Presence Penalty: 0.25
If using programs that support DRY and XTC (at time of writing, Aphrodite Engine supports both and vLLM doesn't support either yet), you can also try using them to cut down on repetition if necessary.
Prompt Format
The calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:
<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>
As such, I would recommend using that format for inference.
Credits
Muse-12B was made by Latitude Games with help from Gryphe Padar
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
Citation
@misc{cook2025sixaccuratenvfp4quantization,
title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
year={2025},
eprint={2512.02010},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.02010},
}
- Downloads last month
- 271
Model tree for DataSnake/Muse-12B-NVFP4-FP8
Base model
mistralai/Mistral-Nemo-Base-2407