Request: FP8 Quantized Version (based on Qwen3.5-27B-FP8)

#1
by OliveiraNickolas - opened

Hi @Jackrong ,

First of all, thank you for sharing this distilled version! The mix of Qwen 3.5 with the reasoning capabilities of Claude 4.6/Opus is very exciting.

Would it be possible for you to create an FP8 version of this specific model?

Since there is already an official FP8 base available (Qwen/Qwen3.5-27B-FP8), having this distilled "v2" in FP8 would be incredible. It would allow many of us to run the model with much better VRAM efficiency and speed without the significant quality loss of lower quantizations.

Thanks for your hard work and for contributing to the community!

Hello, and thank you so much for your support and encouragement!

Since I’m currently using cloud GPU resources at my own expense, things can get a bit tight at times. Still, I’ll do my best to work on the FP8 version.

I really appreciate your suggestion and your patience in following the project.

@OliveiraNickolas
Do you mean trained on the FP8 model or this model being quantized to FP8?

Since llm-compressor doesn't have FP8 support yet, I quantized the model to MXFP4A16 and NVFP4A16. Accuracy results done on lm-evaluation-harness with the task set of leaderboard_math_hard

@OliveiraNickolas
Do you mean trained on the FP8 model or this model being quantized to FP8?

I mean, apply the same approach Jackrong used with the standard Qwen3.5-27B — but using the Qwen3.5-27B-FP8 model that Qwen released.

I'm specifically asking because I run models on vLLM with a 48GB VRAM setup, and the non-FP8 version of Qwen3.5-27B just won't fit on my hardware

Sign up or log in to comment