LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

lbortolotti · 2024-03-04T17:22:45Z

Hi. I have an LSTM based-model for which I've noticed quite large discrepancies in prediction when doing inference on CPU or GPU. The model has been trained with float32 precision throughout. Enabling float64 at inference-time resolves this CPU/GPU discrepancy.

I have prepared a script to reproduce, that demonstrates the issue (on a slightly smaller scale - it seems my trained model amplifies the issue, while the gist I attach demonstrates it on a freshly-initialised model).

You can find the gist here.

I'm on TensorFlow+Keras 2.14.0, running in a docker image based on nvidia/cuda:11.8.0-devel-ubuntu22.04

This is the error between GPU and CPU predictions for what should be the same model:

This is the histogram of "layer activity" (cpu_vs_gpu_intermediate_layer_2.html), which I have zoomed in and in which the two instances of the model are clearly different:

If I uncomment the line
keras.backend.set_floatx('float64')

CPU and GPU produce the exact same output, and the "activity histograms" become identical.

Why would CPU and GPU "diverge", and why only in float32? I'm aware that LSTM has a CUDA-optimized implementation, is it a known issue that this can behave differently to the CPU implementation?

Thanks

The text was updated successfully, but these errors were encountered:

SuryanarayanaY · 2024-03-05T05:38:21Z

Hi @lbortolotti ,

There seems some dependencies missing for reproduction of the issue. Could you please look into the gist and suggest?

There will be some precisional difference between CPU and GPU and also GPU to GPU.Enabling float64 will almost nullify these precisional differences.Also can you compare the difference wrt percentage of the outputs instead of absolute difference and let us know the range?

lbortolotti · 2024-03-06T09:56:25Z

Hi @SuryanarayanaY .

I've fixed the code (it wasn't actually missing dependencies, it was just a __file__ directive that turns out not to work in a notebook context).

While I was at it, I increased the model size a little, which amplified the numerical differences substantially on my linux box.

You can find a cleaned up notebook here.

However, in running the notebook, I have seen that the issue does not replicate on colab.

This is the plot of the predictions of a randomly initialised model on colab:

The exact same code you find on colab, run on my linux box with an nvidia GPU, results in the following plot:

I'm now running TF 2.15 on my linux box, so the TF version is the same that is running on colab.

At this point I'd be inclined to think there's something amiss on the CUDA/nvidia side of things.

On prem I am running CUDA12.2, CuDNN 8904, on an A100, with the following cuda package versions (as pulled in by tensorflow[and-cuda])

nvidia-cublas-cu12==12.2.5.6
nvidia-cuda-cupti-cu12==12.2.142
nvidia-cuda-nvcc-cu12==12.2.140
nvidia-cuda-nvrtc-cu12==12.2.140
nvidia-cuda-runtime-cu12==12.2.140
nvidia-cudnn-cu12==8.9.4.25
nvidia-cufft-cu12==11.0.8.103
nvidia-curand-cu12==10.3.3.141
nvidia-cusolver-cu12==11.5.2.141
nvidia-cusparse-cu12==12.1.2.141
nvidia-nccl-cu12==2.16.5
nvidia-nvjitlink-cu12==12.2.140

Any ideas? Ideally I'd want to try and set up my own "vanilla" colab environment, and install tensorflow[and-cuda] myself to try the same stack there as well. However I've already run out of my GPU usage quota, so haven't managed to try myself.

SuryanarayanaY · 2024-04-01T07:42:00Z

Hi @lbortolotti ,

I have replicated the reported error with keras2 and attached gist here.

Could you please confirm whether this is issue with Keras3. If not an issue with Keras3 then it needs to be reported at tf_keras repo. Thanks!

lbortolotti · 2024-04-04T11:43:26Z

Hi @lbortolotti ,

I have replicated the reported error with keras2 and attached gist here.

Could you please confirm whether this is issue with Keras3. If not an issue with Keras3 then it needs to be reported at tf_keras repo. Thanks!

The linked gist does not seem like it replicates the error? The difference in output between CPU and GPU inference is negligible, much smaller than what I am seeing on-prem.

on-prem

gist

I'd be very happy if the gist replicated, but it doesn't seem to? Or am I missing something?

Thanks!

SuryanarayanaY · 2024-04-05T06:53:33Z

Hi @lbortolotti ,

Yeah I noticed now the plot for CPU is missing in the gist. But the CPU-GPU diff value showing the difference though which I have seen and confirmed as difference.

The code is not compatible for keras3. If the issue can be replicable with Keras3 please report with the modified code snippet. Else please report it at tf_keras repo . Thanks!

lbortolotti · 2024-04-09T15:21:14Z

Hi @SuryanarayanaY ,

The CPU-GPU difference that you also see on the gist is much much much smaller than what I see on-prem.

I've done a few more tests on-prem. I've switched to TF 2.16.1 (using the official tensorflow image: tensorflow/tensorflow:2.16.1-gpu-jupyter)

I've found that:

The issue disappears with keras3
The issue still occurs with tf-keras 2.16.0
I've gotten hold of a system with a Nvidia V100. On this system, the issue does not reproduce (running in the same container)

So the issue seems to specifically occur with 1) tf-keras on 2) Nvidia A100. All other combinations have very small discrepancies between CPU and GPU, that I would say are negligible (including the gist you shared above).

How would you recommend I proceed? I can open an issue on tf-keras, but the fact that there clearly is some interaction with the CUDA side of things, concerns me somewhat.

Thanks

SuryanarayanaY · 2024-04-10T06:34:30Z

The issue disappears with keras3

The issue still occurs with tf-keras 2.16.0

Hi @lbortolotti ,

Please note if the issue is with tf-keras package then the issue is with Keras2 code which needs to be addressed at tf-keras repo.

If you have switched to TF2.16v then by default tf.keras points to Keras3 and if TF2.16v resolves your issue then probably we can mark it resolved here.If the issue not replicable with Keras3 then please report the issue at tf-keras.

lbortolotti · 2024-04-10T07:24:58Z

OK, I'll reopen in tf-keras, and hope can get to the bottom of it there. Thanks!

google-ml-butler · 2024-04-10T07:25:00Z

Are you satisfied with the resolution of your issue?
Yes
No

lbortolotti · 2024-04-10T08:34:20Z

FYI, new issue here keras-team/tf-keras#772

github-actions bot assigned SuryanarayanaY Mar 4, 2024

SuryanarayanaY added the type:bug/performance label Mar 5, 2024

SuryanarayanaY added the stat:awaiting response from contributor label Mar 5, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Mar 6, 2024

SuryanarayanaY added the stat:awaiting response from contributor label Apr 1, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Apr 4, 2024

SuryanarayanaY added the stat:awaiting response from contributor label Apr 5, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Apr 9, 2024

SuryanarayanaY added the stat:awaiting response from contributor label Apr 10, 2024

lbortolotti closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

lbortolotti commented Mar 4, 2024

SuryanarayanaY commented Mar 5, 2024 •

edited

Loading

lbortolotti commented Mar 6, 2024

SuryanarayanaY commented Apr 1, 2024

lbortolotti commented Apr 4, 2024

SuryanarayanaY commented Apr 5, 2024

lbortolotti commented Apr 9, 2024

SuryanarayanaY commented Apr 10, 2024 •

edited

Loading

lbortolotti commented Apr 10, 2024

google-ml-butler bot commented Apr 10, 2024

lbortolotti commented Apr 10, 2024

LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

LSTM - different outputs for same weights across CPU and GPU, when using float32 datatype. Using float64 resolves it. #19254

Comments

lbortolotti commented Mar 4, 2024

SuryanarayanaY commented Mar 5, 2024 • edited Loading

lbortolotti commented Mar 6, 2024

SuryanarayanaY commented Apr 1, 2024

lbortolotti commented Apr 4, 2024

SuryanarayanaY commented Apr 5, 2024

lbortolotti commented Apr 9, 2024

SuryanarayanaY commented Apr 10, 2024 • edited Loading

lbortolotti commented Apr 10, 2024

google-ml-butler bot commented Apr 10, 2024

lbortolotti commented Apr 10, 2024

SuryanarayanaY commented Mar 5, 2024 •

edited

Loading

SuryanarayanaY commented Apr 10, 2024 •

edited

Loading