Request: Official NVFP4 and AWQ versions for RTX 5090 users
#7
by 0xburakcelik - opened
Hey Jackrong,
First of all, thanks a lot for this model — the reasoning quality is really impressive.
I’m running it on an RTX 5090 (32GB). The GGUF Q4 version works great (~16-17GB, 50-55 t/s with llama.cpp), but I’d love to use vLLM + speculative decoding to push it to 120-130+ t/s.
Right now the community AWQ is around 19GB which is usable, but since it’s a Blackwell card, NVFP4 would probably give even better speed and efficiency.
Would you consider uploading official AWQ and especially NVFP4 (or mixed-precision NVFP4) versions of the v2 model? It would help a lot of 50-series users who want to run it with vLLM.
No pressure at all, just a request from someone who really likes the model 🙂