-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Convert.py: Option to skip measurement when setting 8.0/8.0 #673
Comments
If you want to work around this, just grab a random measurement.json from huggingface for the same base model. Bartowski does quants for most of the big models, and usually puts the measurement.json in the https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-exl2/tree/main Then just use -m /path/to/measurement.json in the conversion script when you're doing 8bpw |
What concerns me the most is the lack of an option to manually override the optimization process. The system decides on its own which layers to quantize and to what degree, sometimes doing so in situations where it’s not entirely appropriate. For instance, I want to set a maximum quantization level of 8+ by specifying the parameters -b 8 [9,10,16,255]. However, this doesn’t seem to matter, as the system still arbitrarily quantizes many layers to 4, 5, or 6 bpw. What’s more frustrating is that every time I run the process, it selects layers for quantization in a random order. For example, in one run, it might choose layers 3, 5, and 39, but after restarting with the same parameters, it could switch to layers 4, 9, and 28, and so on. It would be great to have an option to explicitly specify which layers should not be optimized and should instead be quantized with the maximum value. Additionally, it would be useful to define specific quantization ranges for particular layers. For instance, having an additional configuration file where such quantization ranges could be defined would make the process much more convenient and flexible. |
Part of this is because 8 bpw requires some layers to use less than the maximum bitrate. The bitrate specified is the actual number of bits per weight including overhead. With that overhead, the actual maximum is about 8.05 bpw (it varies a bit depending on tensor shapes). I just checked and there was a slight inaccuracy in the optimizer which made it ever so slightly undershoot the target bitrate if the last annealing step left a tiny bit of the cost budget unused. This shouldn't happen, so I fixed it in the latest commit to the dev branch. With that, you should be able to set a target bitrate of e.g. 9 and always get the largest setting for each layer. Note that it's highly unlikely to make any practical difference since the reason this happens in the first place is that the measured difference between the highest and next-highest setting for a given layer is below the noise floor. I might add a shortcut to skip measurement and simply use the max bitrate as an option, but I'm also looking at completely reworking the quantization scheme anyway. |
Problem
Still doing mesurement when set to 8.0 bpw.
Solution
Skip the measurement/generate a dummy meaurement file.
Alternatives
No response
Explanation
What's the point of measurement if you're using 8.0 on all layers anyway? Or is there any ignored/acceptable loss threshold will cause lower bpw like 5~6 to be used even 8 is set?
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: