-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FIx bug with f16 overfloat #482
Conversation
Maybe it's added accidently. There is no reason to quantize
This is true in general. But in the case of quantization, we can make sure that the quantized floating-point numbers are in the representable range of integer data type. |
We can remove the "f16 quantized to int16" test case. The PR also looks good. |
Yes, I think it what we should do |
@yaoyaoding
By some reason it works with on Tests workflow but fail on Publish. So, in code just
Seems use different versions of CUDA there. What is the right way to convert to int8? |
### PR Comment: In the new version of the `transformers` library (version 4.45.0) used in CI, the `merges` field in the configuration has changed to a list of lists. To ensure compatibility with this update, I have modified our code base to accommodate this change. Without this adjustment, the `test_tokenizer` in CI would fail to execute successfully. This update ensures that the tests run as expected with the new library version.
I encountered this problem multiple times. The reason is that C++ finds multiple "middle type" during converting bfloat16 to int8 when there is not a direct conversion from bf16 to int8. To address the issue we can explicitly specify a middle type like |
Actually I fixed it with updating test image. With new one there is no such problem. |
Add preliminary conversion to
f32
for quantization.Describe on example
f16
quantize toint16
(it is a bit unclear why we need it but we support it and test it).f16
can not hold exact value ofint16.max_value
and we gotf16
number that more thanint16.max_value
. In result we gotmax_min
value in quantizedint16
instead ofmax_int
.In fact, converting a float to an int can have unpredictable behavior when the float exceeds the maximum value that the target int can hold.