Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic

FP8_DYNAMIC quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but quantizes most linear layers to FP8 W8A8 while leaving the most sensitive projections and sidecar components in BF16.

The published folder includes:

  • model.safetensors
  • model.safetensors.index.json
  • model.mtp.safetensors
  • processor_config.json
  • preprocessor_config.json
  • video_preprocessor_config.json
  • recipe.yaml

Verified Inference

Local export and sanity-check evaluation were verified on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • transformers==5.3.0
  • llm-compressor==0.10.1.dev40+g5ae2e149
  • vllm==0.17.1

What was verified in that run:

  • the FP8 export completed successfully
  • model.mtp.safetensors was restored into the output folder
  • the checkpoint loads in transformers with device_map="auto"
  • a quick perplexity sanity check against the BF16 source completed successfully

vLLM is still the intended serving path for this family of models, but full local serve validation for this exact FP8 v2 export is still pending.

Quantization Strategy

Uniform FP8_DYNAMIC quantization using llm-compressor:

Precision Layers
FP8 W8A8 most Linear layers
BF16 lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar

FP8 details:

  • weights: FP8, per-channel, static scales
  • input activations: FP8, per-token, dynamic scales
  • output activations: not explicitly quantized

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers
  • full_attention_interval=4
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144

Local Benchmark Slice

Local quick sanity comparison against the BF16 v2 source model, run on 2026-03-31 with a small FineWeb-Edu perplexity slice.

This is a reduced verification slice, not a full benchmark run.

Benchmark BF16 v2 FP8 v2
FineWeb-Edu perplexity (20 samples, 13,531 tokens, max_len=1024) 7.0758 7.1051

Absolute perplexity delta:

  • +0.0293
  • about +0.41% relative to BF16

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

Expected serving command:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP enabled:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
    trust_remote_code=True,
)

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Expected Intended serving path; this exact v2 FP8 export still needs full local serve re-validation
transformers >= 5.3.0 Yes Direct loading works with device_map="auto"
SGLang Unknown Not verified for this export

Notes

  • This export keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than FP8.
  • The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.
  • The perplexity numbers above are intended as a quick sanity check, not as an exhaustive benchmark submission.
Downloads last month
2,385
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic

Collection including mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic