Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

Closed
lbortolotti opened this issue Mar 4, 2024 · 10 comments

Comments

@lbortolotti
Copy link

Hi. I have an LSTM based-model for which I've noticed quite large discrepancies in prediction when doing inference on CPU or GPU. The model has been trained with float32 precision throughout. Enabling float64 at inference-time resolves this CPU/GPU discrepancy.

I have prepared a script to reproduce, that demonstrates the issue (on a slightly smaller scale - it seems my trained model amplifies the issue, while the gist I attach demonstrates it on a freshly-initialised model).

You can find the gist here.

I'm on TensorFlow+Keras 2.14.0, running in a docker image based on nvidia/cuda:11.8.0-devel-ubuntu22.04

This is the error between GPU and CPU predictions for what should be the same model:
newplot(29)

This is the histogram of "layer activity" (cpu_vs_gpu_intermediate_layer_2.html), which I have zoomed in and in which the two instances of the model are clearly different:
image

If I uncomment the line
keras.backend.set_floatx('float64')

CPU and GPU produce the exact same output, and the "activity histograms" become identical.

Why would CPU and GPU "diverge", and why only in float32? I'm aware that LSTM has a CUDA-optimized implementation, is it a known issue that this can behave differently to the CPU implementation?

Thanks

@SuryanarayanaY
Copy link
Contributor

SuryanarayanaY commented Mar 5, 2024

Hi @lbortolotti ,

There seems some dependencies missing for reproduction of the issue. Could you please look into the gist and suggest?

There will be some precisional difference between CPU and GPU and also GPU to GPU.Enabling float64 will almost nullify these precisional differences.Also can you compare the difference wrt percentage of the outputs instead of absolute difference and let us know the range?

@lbortolotti
Copy link
Author

Hi @SuryanarayanaY .

I've fixed the code (it wasn't actually missing dependencies, it was just a __file__ directive that turns out not to work in a notebook context).

While I was at it, I increased the model size a little, which amplified the numerical differences substantially on my linux box.

You can find a cleaned up notebook here.

However, in running the notebook, I have seen that the issue does not replicate on colab.

This is the plot of the predictions of a randomly initialised model on colab:
image

The exact same code you find on colab, run on my linux box with an nvidia GPU, results in the following plot:
image

I'm now running TF 2.15 on my linux box, so the TF version is the same that is running on colab.

At this point I'd be inclined to think there's something amiss on the CUDA/nvidia side of things.

On prem I am running CUDA12.2, CuDNN 8904, on an A100, with the following cuda package versions (as pulled in by tensorflow[and-cuda])

nvidia-cublas-cu12==12.2.5.6
nvidia-cuda-cupti-cu12==12.2.142
nvidia-cuda-nvcc-cu12==12.2.140
nvidia-cuda-nvrtc-cu12==12.2.140
nvidia-cuda-runtime-cu12==12.2.140
nvidia-cudnn-cu12==8.9.4.25
nvidia-cufft-cu12==11.0.8.103
nvidia-curand-cu12==10.3.3.141
nvidia-cusolver-cu12==11.5.2.141
nvidia-cusparse-cu12==12.1.2.141
nvidia-nccl-cu12==2.16.5
nvidia-nvjitlink-cu12==12.2.140

Any ideas? Ideally I'd want to try and set up my own "vanilla" colab environment, and install tensorflow[and-cuda] myself to try the same stack there as well. However I've already run out of my GPU usage quota, so haven't managed to try myself.

@SuryanarayanaY
Copy link
Contributor

Hi @lbortolotti ,

I have replicated the reported error with keras2 and attached gist here.

Could you please confirm whether this is issue with Keras3. If not an issue with Keras3 then it needs to be reported at tf_keras repo. Thanks!

@lbortolotti
Copy link
Author

Hi @lbortolotti ,

I have replicated the reported error with keras2 and attached gist here.

Could you please confirm whether this is issue with Keras3. If not an issue with Keras3 then it needs to be reported at tf_keras repo. Thanks!

The linked gist does not seem like it replicates the error? The difference in output between CPU and GPU inference is negligible, much smaller than what I am seeing on-prem.

on-prem
310431472-960d1a17-d9bb-4f83-a326-c468cc0ec8a8

gist
immagine

I'd be very happy if the gist replicated, but it doesn't seem to? Or am I missing something?

Thanks!

@SuryanarayanaY
Copy link
Contributor

Hi @lbortolotti ,

Yeah I noticed now the plot for CPU is missing in the gist. But the CPU-GPU diff value showing the difference though which I have seen and confirmed as difference.

The code is not compatible for keras3. If the issue can be replicable with Keras3 please report with the modified code snippet. Else please report it at tf_keras repo . Thanks!

@lbortolotti
Copy link
Author

Hi @SuryanarayanaY ,

The CPU-GPU difference that you also see on the gist is much much much smaller than what I see on-prem.

I've done a few more tests on-prem. I've switched to TF 2.16.1 (using the official tensorflow image: tensorflow/tensorflow:2.16.1-gpu-jupyter)

I've found that:

  1. The issue disappears with keras3
  2. The issue still occurs with tf-keras 2.16.0
  3. I've gotten hold of a system with a Nvidia V100. On this system, the issue does not reproduce (running in the same container)

So the issue seems to specifically occur with 1) tf-keras on 2) Nvidia A100. All other combinations have very small discrepancies between CPU and GPU, that I would say are negligible (including the gist you shared above).

How would you recommend I proceed? I can open an issue on tf-keras, but the fact that there clearly is some interaction with the CUDA side of things, concerns me somewhat.

Thanks

@SuryanarayanaY
Copy link
Contributor

SuryanarayanaY commented Apr 10, 2024

  • The issue disappears with keras3
  • The issue still occurs with tf-keras 2.16.0

Hi @lbortolotti ,

Please note if the issue is with tf-keras package then the issue is with Keras2 code which needs to be addressed at tf-keras repo.

If you have switched to TF2.16v then by default tf.keras points to Keras3 and if TF2.16v resolves your issue then probably we can mark it resolved here.If the issue not replicable with Keras3 then please report the issue at tf-keras.

@lbortolotti
Copy link
Author

OK, I'll reopen in tf-keras, and hope can get to the bottom of it there. Thanks!

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@lbortolotti
Copy link
Author

FYI, new issue here keras-team/tf-keras#772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants