Math Performance

#4
by selimaktas - opened

The model seems to have regressed on math tasks compared to the base Qwen3.5-27B. Most likely due to Claude 4.6 Opus datasets mostly being about code, possibly fixable with more variation in the data.

Tests done on lm-eval harness and vLLM.

Tasks Version Filter n-shot Metric Baseline Distilled-v2
leaderboard_math_hard none exact_match ↑ 0.6511 0.5521
- leaderboard_math_algebra_hard 3 none 4 exact_match ↑ 0.8893 0.7655
none 4 exact_match_original ↑ 0.7850 0.6515
- leaderboard_math_counting_and_prob_hard 3 none 4 exact_match ↑ 0.6504 0.5772
none 4 exact_match_original ↑ 0.5285 0.4146
- leaderboard_math_geometry_hard 3 none 4 exact_match ↑ 0.5152 0.3864
none 4 exact_match_original ↑ 0.4545 0.3258
- leaderboard_math_intermediate_algebra_hard 3 none 4 exact_match ↑ 0.3821 0.3357
none 4 exact_match_original ↑ 0.3464 0.2857
- leaderboard_math_num_theory_hard 3 none 4 exact_match ↑ 0.7532 0.5909
none 4 exact_match_original ↑ 0.6688 0.4935
- leaderboard_math_prealgebra_hard 3 none 4 exact_match ↑ 0.8187 0.7720
none 4 exact_match_original ↑ 0.6580 0.5596
- leaderboard_math_precalculus_hard 3 none 4 exact_match ↑ 0.4444 0.2963
none 4 exact_match_original ↑ 0.3630 0.2519

Hi, thank you for your evaluation and detailed results β€” really appreciate it!

I’ve just completed a full evaluation on MMLU-Pro. Compared to the official base model, this version shows about a βˆ’7.24 pp drop, indicating some reduction in general knowledge reasoning performance.

This trade-off has now been clearly noted in the model card.

Do you think more varied distillation data would help preserve/improve on all tasks? It seems that Claude-4.6-Opus datasets are mostly code generation, which makes sense since its the best code model

Original Qwen reasoning for me takes 6 minutes and this model takes less than a minute. Original Qwen get's the answer rounded to minutes, this model to hours. So the faster reasoning has drawbacks I would not use - but who would like to wait 6 min for an answer?

I would say the end goal should be superior reasoning efficiency to Qwen3.5 while matching/improving performance.

Sign up or log in to comment