"NaN or Inf found in input tensor" #6

EnriqueAlbalate · 2024-04-24T10:39:02Z

While executing the code with the three UNET approaches were proposed in this repository, the model returns fully NaN tensors as output, which leads to the message on the title "NaN or Inf found in input tensor" when the program tries to calculate the loss.
I think the problem could be related to the Automatic Mixed Precission (AMP) (right now it is set to 16 bit), but I'm not sure.

Could you help me? What can I do?
Thank you very much in advance.

isaaccorley · 2024-04-24T11:37:08Z

I'm not able to reproduce this. What versions of torch/lightning/cuda/cudnn/etc are you using?

EnriqueAlbalate · 2024-04-24T11:48:33Z

torchgeo==0.6.0.dev0
kornia==0.7.2
lightning==2.2.2
pandas==2.2.2
tqdm==4.66.2
numpy==1.26.4
matplotlib==3.8.4
pillow==10.3.0
torch==2.1.2
segmentation_models_pytorch==0.3.3
torchmetrics==1.2.0
torchvision==0.16.2
image_bbox_slicer==0.4
einops==0.7.0
timm==0.9.2

Mi cuda version is 12.2

I have been able to run succesfully the code changing the precision to "32-true" instead of "16-mixed" (I read this was the default precision in the Trainer script). I don't know if this could affect the results.

Also I have checked the process only need 800MB even increasing the batch size, so I think the code just train with images one by one. Can you confirm that?

Thanks for your response

isaaccorley · 2024-04-24T12:25:29Z

Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust.

EnriqueAlbalate · 2024-04-24T12:44:11Z

Although I change the batch sise with the command line parameter you said, it seems like the model process just one by one the 7120 training images (I'm using Levir-CD in this case as my dataset). I have printed the batch tensor shape and its batch dimension is 1. However, I will try to adapt all versions whether I find out the results have been affected after changing the Trainer precision parameter. El mié., 24 abr. 2024 14:25, Isaac Corley ***@***.***> escribió:

…

Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust. — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAYGIUFBV6HP2JD7HKGUOXLY66QE7AVCNFSM6AAAAABGWU7L3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHAZDONRZG4> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

isaaccorley · 2024-04-24T12:57:55Z

I'm not able to reproduce this either. Where are you printing the batch?

EnriqueAlbalate · 2024-04-24T13:04:22Z

Inside the training_step() method on the change_detection.py script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"NaN or Inf found in input tensor" #6

"NaN or Inf found in input tensor" #6

EnriqueAlbalate commented Apr 24, 2024

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024 via email

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024 via email •

edited

Loading

"NaN or Inf found in input tensor" #6

"NaN or Inf found in input tensor" #6

Comments

EnriqueAlbalate commented Apr 24, 2024

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024 via email

isaaccorley commented Apr 24, 2024

EnriqueAlbalate commented Apr 24, 2024 via email • edited Loading

EnriqueAlbalate commented Apr 24, 2024 via email •

edited

Loading