IQ4_XS vs. Q5_K_P
I was wondering, since iq quants usually provide better performance than k quants of the same bpw and with the q5_k_p quant also supposedly having better performance than a usual _K_M quant, whether they are equal in performance/quality or not. Does anyone have the hardware/time to run tests? Or has someone already run such tests?
I'd like to chime in an say that iq quants are overall inferior. AI is capable of nearly an infinite number of unique outputs, and any attempt to select a tiny biased dataset to determine what maximizes performance in a specific use case results in varying degrees of damage in other areas, in broad knowledge retention, and so on. Basically what it does is make an AI model less general purposed and balanced in order to squeeze an almost imperceptible amount of extra performance under select circumstances.
It may appear to help at bit depths under 4, but really it's just helping to maintain coherency at the expense of knowledge. For example, the new 1-bit models being released are unusable despite their claim of nearly matching the performance of full float models. This is because coherence is low resolution (how words relate to each other), and knowledge is high resolution. So not only does it hallucinate far more about basic facts than full float models, but produces horrible synonym lists and so on.
This is all that's happening with iq quants. They're maintaining better coherency at lower bit depths, but are experiencing a large drop in total knowledge. It's basically turning LLMs into valley girls. All talk, no substance.
In short, iq quants bad, k quants good. And for some reason the k_p quants can be better than K_M when done right (maintains coherency and high resolution features like knowledge).
Are you referring to the Bonsai models (1.7b, 4b, 8b) from PrismMl?
@Data-vanOrtus Yes, the Bonsai models. I tried the 8b and it maintained coherency, but failed at high resolution tasks like knowledge retrieval, discerning subtle differences (e.g. synonym lists), and so on.
It's hard to isolate the impact of its 1-bit nature by comparing it to other models trained on a different corpus, such as Llama 3.1 8b, but Bonsai has orders of magnitude less broad knowledge, and performs far worse at tasks that require higher resolution (e.g. synonym lists and other tasks that require nuance between related words).
Didn't know why it performed worse at such tasks until I did some research, and then it made sense. For example, 1-bit LLMs perform nearly as well as full float LLMs on things like coherency (grammatical relationships of words), since these are low resolution tasks (little depth). But on high resolution tasks that require subtle differences to be discerned, such as who sang a particular song or a synonym list, it performs FAR worse than full-float LLMs.
Part of the issue is the multiple choice tests used to evaluate LLMs. That is, the loss of high resolution may prevent the accurate and full recall of relevant information, but it still allows the LLM to pick the correct answer out of a lineup. Even then, 8b Bonsai does worse than 4b models on knowledge focused multiple choice tests. There's a reason why they disabled the comment section. All model makers extensively test their models, and this one has numerous fatal flaws that make it non-viable, and they know it. Even massive trillion parameter 1-bit models are non-viable because they'll get their ass kicked in general capability by much smaller, cheaper, and faster models quantized down to 4-bit for inference.
So i presume this is to say that the q5 quant will probably be the higher output quality choice?
@Data-vanOrtus q5 is really only better than q4 with some smaller models, which includes E4b & E2b. Initially I tested them at q4, but they hallucinated more than their full-float versions, so I ended up testing with q5.
There really is a sharp drop off in quality below Q4_K_M, and by Q3_K_M the difference is stark. Q4 offers ~16 levels of precision, which is enough to keep the path through the weights almost identical to the full float path.
You can think of inference like taking a long walk through a forest, and at ever step the 16 bit precision pushes you left or right just enough to keep you on the same path as the full float weights, so by the end of a long hike (inference path) you end up in the same place (e.g. going from the famous actor Tom to Cruise vs Tom to Hardy or something else). In ~99% of cases you end up at the same token pool using 16 levels of path corrections vs the full float corrections.
However, going even a little bit lower, such as to only 14 strengths of course correction pushes at every step, regularly leads you to a different final destination (token pool). This is why 1-bit LLMs will never be viable. The final tokens are still reasonable, so the resultant sentences are coherent an grammatically correct, but they're almost always wrong (e.g. the famous actor Tom Hardy vs Tom Cruise, or the giraffe swam gracefully through the water).
Yeah they seemed a little too good....
I'd like to chime in an say that iq quants are overall inferior. AI is capable of nearly an infinite number of unique outputs, and any attempt to select a tiny biased dataset to determine what maximizes performance in a specific use case results in varying degrees of damage in other areas, in broad knowledge retention, and so on. Basically what it does is make an AI model less general purposed and balanced in order to squeeze an almost imperceptible amount of extra performance under select circumstances.
Artificial Intelligence doesn’t exist, and the sooner people stop slapping the label “AI” on every chatbot that runs on LLMs, the better.
Honestly, the only useful thing about the term is how easily it exposes the “I just jumped on this because it’s trendy” crowd—it’s basically a built-in filter.
The moment someone uses that phrase, it’s a dead giveaway they don’t know what they’re talking about—which, at least, saves you the trouble of taking anything else they say seriously.
@Think4Me I agree. The transformer architecture doesn't even appear to be on the road to AI, let alone AGI. It's just a pattern matching mirror of human thoughts. But at this point AI is the most recognized identifier. Plus there really isn't a good alternative. LLM (large language model) doesn't account for things like multi-modality, world modeling, etc.. But I'm with you, someone famous needs to come up with a better name than AI that becomes widely adopted. Until then I'm probably not going to stop using it. If it helps just think of AI as Ain't Intelligence.
Well....this type of system essentially just reads from a really fancy probability distribution based on its input....so a distribution sampler would be a fitting name...this also seems to be true for diffusion models as far as I could tell...for chatbots i just use next token predictor....with some world models that might need to be revised...
@Data-vanOrtus You're right, next token prediction doesn't work because it excludes diffusion models, so it comes down to the probabilistic nature of the sampler.
In fact, this shines a light on why current models aren't AI. For example, it's obviously to any human when we're outputting an immutable fact, like a proper noun. In fact, humans look ahead and are aware of approaching an immutable fact and commonly avoid awkward situations by skirting around bringing up a person's name we're not sure about.
But instead of this look ahead and moment to moment awareness that humans have current AI models only look behind and lack any moment to moment awareness. They just use the same blind sampling techniques on all tokens.
Any true AI would at least have enough awareness to on the fly recognize the next word is an immutable fact, hence not use normal sampling unless it had high confidence, but instead either look it up, then continue next token prediction, or pause next token prediction to do a few lateral inference passes in an attempt to raise its confidence. In short, if there's no look ahead and on the fly awareness then it's not AI.
Or not even use ntp to "think" in the first place, instead actually think about the prompt and plan a response and then merely use tokens/text to represent the result...i do not know how that would work or how you'd go about building such a system however