Q4_K Quantization Scheme adaptation #6760
-
So I see that to dequantize a weight in Q4_K format I have to do: y = s * q - m y is the dequantized weight (float) Now the challenge is that I have hardware that expects the following scheme: y = s * (q - z) The difference from above is that z is the zero point (int4) This is also the scheme that pytorch uses: https://pytorch.org/blog/quantization-in-practice/ I want to use the parameters from llama.cpp with my hardware, so I try to do some math... Set the two equations equal to each other, and solve for z. I get z = round(m/s) When I simulate this adaptation, I get catastrophic accuracy loss, even without hardware involved. Is there something fundamentally wrong with this math? Is it not possible to reconcile these two quantization schemes? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
@ikawrakow I’m wondering if you might be able to help me with this math/quantization question. If you have more important things to do I completely understand. |
Beta Was this translation helpful? Give feedback.
If you want to use
y = s * (q - z)
whereq
andz
are bothint4
, you are basically looking at something similar to aQ4_0
quantization (being exactlyQ4_0
ifz = 8
). The whole point ofQ4_K
is that the offset from zero being used has a better precision. If you want to still try withQ4_K
, you need to scale the quants up (hopefully your hardware can operate efficiently onint8_t
's). I.e.,This will work most of the time, but you need to be careful with overflow of
8*m/s
(theq's
are in0...15
, so8*q
is in the allowed range of a signed 8-bit integer, so …