Math Performance
The model seems to have regressed on math tasks compared to the base Qwen3.5-27B. Most likely due to Claude 4.6 Opus datasets mostly being about code, possibly fixable with more variation in the data.
Tests done on lm-eval harness and vLLM.
| Tasks | Version | Filter | n-shot | Metric | Baseline | Distilled-v2 | ||
|---|---|---|---|---|---|---|---|---|
| leaderboard_math_hard | none | exact_match | β | 0.6511 | 0.5521 | |||
| - leaderboard_math_algebra_hard | 3 | none | 4 | exact_match | β | 0.8893 | 0.7655 | |
| none | 4 | exact_match_original | β | 0.7850 | 0.6515 | |||
| - leaderboard_math_counting_and_prob_hard | 3 | none | 4 | exact_match | β | 0.6504 | 0.5772 | |
| none | 4 | exact_match_original | β | 0.5285 | 0.4146 | |||
| - leaderboard_math_geometry_hard | 3 | none | 4 | exact_match | β | 0.5152 | 0.3864 | |
| none | 4 | exact_match_original | β | 0.4545 | 0.3258 | |||
| - leaderboard_math_intermediate_algebra_hard | 3 | none | 4 | exact_match | β | 0.3821 | 0.3357 | |
| none | 4 | exact_match_original | β | 0.3464 | 0.2857 | |||
| - leaderboard_math_num_theory_hard | 3 | none | 4 | exact_match | β | 0.7532 | 0.5909 | |
| none | 4 | exact_match_original | β | 0.6688 | 0.4935 | |||
| - leaderboard_math_prealgebra_hard | 3 | none | 4 | exact_match | β | 0.8187 | 0.7720 | |
| none | 4 | exact_match_original | β | 0.6580 | 0.5596 | |||
| - leaderboard_math_precalculus_hard | 3 | none | 4 | exact_match | β | 0.4444 | 0.2963 | |
| none | 4 | exact_match_original | β | 0.3630 | 0.2519 |
Hi, thank you for your evaluation and detailed results β really appreciate it!
Iβve just completed a full evaluation on MMLU-Pro. Compared to the official base model, this version shows about a β7.24 pp drop, indicating some reduction in general knowledge reasoning performance.
This trade-off has now been clearly noted in the model card.
Do you think more varied distillation data would help preserve/improve on all tasks? It seems that Claude-4.6-Opus datasets are mostly code generation, which makes sense since its the best code model
Original Qwen reasoning for me takes 6 minutes and this model takes less than a minute. Original Qwen get's the answer rounded to minutes, this model to hours. So the faster reasoning has drawbacks I would not use - but who would like to wait 6 min for an answer?
I would say the end goal should be superior reasoning efficiency to Qwen3.5 while matching/improving performance.