I find these models interesting, but have a couple thoughts.
Firstly, since the test scores of the Q4_K_M versions of 16-bit models are nearly identical logic dictates that all comparisons (e.g. size and speed) should be made against Q4_K_M.
Secondly, while the MMLU-R test scores only appear to drop moderately, this doesn't accurately measure the degree to which a model's knowledge is damaged. Mainly because even though a model can be quantized down and retain the ability to recognize the correct answer in a provided line-up, as is the case with multiple choice tests, the ability to fully and accurately recall the information, which reflects real world use cases, is greatly diminished.
After observing extreme knowledge loss in bitnet, and the last Bonsai models, I did some research and apparently this is the unavoidable consequence of low bitrate LLMs. Low resolution accuracy (e.g. coherency) can be largely preserved, but high resolution accuracy (e.g. knowledge) cannot. And coherency without knowledge is non-viable outside a select few use cases.
I mean, of course they compare against 16-bit models, it's the same as what Google did with TurboQuant, it's good, but if you compare against whatever gives you the best numbers instead of against what people actually use it looks a lot better.
Because this:
Looks much better than this (sizes not exact but good enough to illustrate the point):
And even better than if we compared against Qwen3.5.