Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic

FP8_DYNAMIC quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but quantizes most linear layers to FP8 W8A8 while leaving the most sensitive projections and sidecar components in BF16.

The published folder includes:

model.safetensors
model.safetensors.index.json
model.mtp.safetensors
processor_config.json
preprocessor_config.json
video_preprocessor_config.json
recipe.yaml

Verified Inference

Local export and sanity-check evaluation were verified on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

transformers==5.3.0
llm-compressor==0.10.1.dev40+g5ae2e149
vllm==0.17.1

What was verified in that run:

the FP8 export completed successfully
model.mtp.safetensors was restored into the output folder
the checkpoint loads in transformers with device_map="auto"
a quick perplexity sanity check against the BF16 source completed successfully

vLLM is still the intended serving path for this family of models, but full local serve validation for this exact FP8 v2 export is still pending.

Quantization Strategy

Uniform FP8_DYNAMIC quantization using llm-compressor:

Precision	Layers
FP8 W8A8	most `Linear` layers
BF16	`lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar

FP8 details:

weights: FP8, per-channel, static scales
input activations: FP8, per-token, dynamic scales
output activations: not explicitly quantized

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers
full_attention_interval=4
mtp_num_hidden_layers=1
max_position_embeddings=262144

Local Benchmark Slice

Local quick sanity comparison against the BF16 v2 source model, run on 2026-03-31 with a small FineWeb-Edu perplexity slice.

This is a reduced verification slice, not a full benchmark run.

Benchmark	BF16 v2	FP8 v2
FineWeb-Edu perplexity (20 samples, 13,531 tokens, max_len=1024)	7.0758	7.1051

Absolute perplexity delta:

+0.0293
about +0.41% relative to BF16

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

Expected serving command:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP enabled:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Expected	Intended serving path; this exact v2 FP8 export still needs full local serve re-validation
transformers >= 5.3.0	Yes	Direct loading works with `device_map="auto"`
SGLang	Unknown	Not verified for this export

Notes

This export keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than FP8.
The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.
The perplexity numbers above are intended as a quick sanity check, not as an exhaustive benchmark submission.