Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"NaN or Inf found in input tensor" #6

Open
EnriqueAlbalate opened this issue Apr 24, 2024 · 6 comments
Open

"NaN or Inf found in input tensor" #6

EnriqueAlbalate opened this issue Apr 24, 2024 · 6 comments

Comments

@EnriqueAlbalate
Copy link

While executing the code with the three UNET approaches were proposed in this repository, the model returns fully NaN tensors as output, which leads to the message on the title "NaN or Inf found in input tensor" when the program tries to calculate the loss.
I think the problem could be related to the Automatic Mixed Precission (AMP) (right now it is set to 16 bit), but I'm not sure.

Could you help me? What can I do?
Thank you very much in advance.

@isaaccorley
Copy link
Owner

I'm not able to reproduce this. What versions of torch/lightning/cuda/cudnn/etc are you using?

@EnriqueAlbalate
Copy link
Author

torchgeo==0.6.0.dev0
kornia==0.7.2
lightning==2.2.2
pandas==2.2.2
tqdm==4.66.2
numpy==1.26.4
matplotlib==3.8.4
pillow==10.3.0
torch==2.1.2
segmentation_models_pytorch==0.3.3
torchmetrics==1.2.0
torchvision==0.16.2
image_bbox_slicer==0.4
einops==0.7.0
timm==0.9.2

Mi cuda version is 12.2

I have been able to run succesfully the code changing the precision to "32-true" instead of "16-mixed" (I read this was the default precision in the Trainer script). I don't know if this could affect the results.

Also I have checked the process only need 800MB even increasing the batch size, so I think the code just train with images one by one. Can you confirm that?

Thanks for your response

@isaaccorley
Copy link
Owner

Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust.

@EnriqueAlbalate
Copy link
Author

EnriqueAlbalate commented Apr 24, 2024 via email

@isaaccorley
Copy link
Owner

I'm not able to reproduce this either. Where are you printing the batch?

@EnriqueAlbalate
Copy link
Author

EnriqueAlbalate commented Apr 24, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants