-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does the sequence length of vectors affect the calculation results of dense under bf16? #19878
Comments
my test enviroment as follow this bug also be found at windows 10+torch2.2.1+keras3.3.3 |
I reproduced this bug in this environment:
cuda device: 3060-12G code: import os
os.environ['KERAS_BACKEND'] = 'torch'
os.environ['OPS_KERNAL'] = '1'
import keras
keras.config.set_floatx('bfloat16')
from keras import ops
import numpy as np
initial_dim = 2048
finally_dim = 64
z = ops.convert_to_tensor(np.random.random([1,36,initial_dim]))
dense = keras.layers.Dense(finally_dim)
z1 = dense(z)
z2 = dense(z[:,:8])
print(ops.isclose(z1[:,:8],z2).all()) output:
|
I cannot reproduce this error. One possible reason is that the graphics card is a 2080Ti, and the tensorcores of the 2080Ti do not support bfloat16 calculations, so bfloat16 is handled by CUDA. platform: windows 10 output: |
@pass-lin Could you provide a colab that reproduces the issue? |
I don't think I can provide you with a Windows environment or one with an RTX 30 or 40 series on Colab. |
@pass-lin Unfortunately, I don't think we would be able to help with debugging if we can't reproduce the issue on our side. |
Can't you reproduce this bug in A100 or Windows? |
Hi @pass-lin! I was able to reproduce the bug in A100 with the Torch backend! However, things worked as expected with JAX backend! Anyway, this is a bug and we'll look into it! Thanks for reporting the issue. |
import os
os.environ['KERAS_BACKEND'] = 'torch'
os.environ['OPS_KERNAL'] = '1'
import keras
keras.config.set_floatx('bfloat16')
from keras import ops
import numpy as np
initial_dim = 2048
finally_dim = 64
z = ops.convert_to_tensor(np.random.random([1,36,initial_dim]))
dense = keras.layers.Dense(finally_dim)
z1 = dense(z)
z2 = dense(z[:,:8])
print(ops.isclose(z1[:,:8],z2).all())
Example code is as above. In some cases, when the above z1 and z2 are found to not pass isclose, theoretically, and under fp32, they should be able to pass isclose in any situation. What is the problem, and how can it be solved?
This bug also be found at tf and jax backend,but not found at numpy backend
pass case:initial_dim = 2048 finally_dim =2048 ;initial_dim = 2048 finally_dim =4096 ;initial_dim = 1024 finally_dim =2048 ;
fail case: initial_dim = 2048 finally_dim =64;initial_dim = 2048 finally_dim =1024 ;initial_dim = 1024 finally_dim =2047 ;
However, similarly, we did not find a similar issue in pure torch.
import torch
import numpy as np
initial_dim = 4096
finally_dim = 32
z = torch.tensor(np.random.random([1,36,initial_dim]),dtype=torch.bfloat16)
linear = torch.nn.Linear(initial_dim,finally_dim).bfloat16()
z1 = linear(z)
z2 = linear(z[:,:8])
print(torch.isclose(z1[:,:8],z2).all())
The text was updated successfully, but these errors were encountered: