Math Performance

by selimaktas - opened about 1 month ago

The model seems to have regressed on math tasks compared to the base Qwen3.5-27B. Most likely due to Claude 4.6 Opus datasets mostly being about code, possibly fixable with more variation in the data.

Tests done on lm-eval harness and vLLM.

Tasks	Version	Filter	n-shot	Metric		Baseline	Distilled-v2
leaderboard_math_hard		none		exact_match	↑	0.6511	0.5521
- leaderboard_math_algebra_hard	3	none	4	exact_match	↑	0.8893	0.7655
		none	4	exact_match_original	↑	0.7850	0.6515
- leaderboard_math_counting_and_prob_hard	3	none	4	exact_match	↑	0.6504	0.5772
		none	4	exact_match_original	↑	0.5285	0.4146
- leaderboard_math_geometry_hard	3	none	4	exact_match	↑	0.5152	0.3864
		none	4	exact_match_original	↑	0.4545	0.3258
- leaderboard_math_intermediate_algebra_hard	3	none	4	exact_match	↑	0.3821	0.3357
		none	4	exact_match_original	↑	0.3464	0.2857
- leaderboard_math_num_theory_hard	3	none	4	exact_match	↑	0.7532	0.5909
		none	4	exact_match_original	↑	0.6688	0.4935
- leaderboard_math_prealgebra_hard	3	none	4	exact_match	↑	0.8187	0.7720
		none	4	exact_match_original	↑	0.6580	0.5596
- leaderboard_math_precalculus_hard	3	none	4	exact_match	↑	0.4444	0.2963
		none	4	exact_match_original	↑	0.3630	0.2519

Jackrong

Owner about 1 month ago

Hi, thank you for your evaluation and detailed results — really appreciate it!

I’ve just completed a full evaluation on MMLU-Pro. Compared to the official base model, this version shows about a −7.24 pp drop, indicating some reduction in general knowledge reasoning performance.

This trade-off has now been clearly noted in the model card.

selimaktas

30 days ago

•

edited 30 days ago

Do you think more varied distillation data would help preserve/improve on all tasks? It seems that Claude-4.6-Opus datasets are mostly code generation, which makes sense since its the best code model

akisviete

28 days ago

Original Qwen reasoning for me takes 6 minutes and this model takes less than a minute. Original Qwen get's the answer rounded to minutes, this model to hours. So the faster reasoning has drawbacks I would not use - but who would like to wait 6 min for an answer?

selimaktas

27 days ago

I would say the end goal should be superior reasoning efficiency to Qwen3.5 while matching/improving performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment