-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"NaN or Inf found in input tensor" #6
Comments
I'm not able to reproduce this. What versions of torch/lightning/cuda/cudnn/etc are you using? |
torchgeo==0.6.0.dev0 Mi cuda version is 12.2 I have been able to run succesfully the code changing the precision to "32-true" instead of "16-mixed" (I read this was the default precision in the Trainer script). I don't know if this could affect the results. Also I have checked the process only need 800MB even increasing the batch size, so I think the code just train with images one by one. Can you confirm that? Thanks for your response |
Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust. |
Although I change the batch sise with the command line parameter you said,
it seems like the model process just one by one the 7120 training images
(I'm using Levir-CD in this case as my dataset). I have printed the batch
tensor shape and its batch dimension is 1.
However, I will try to adapt all versions whether I find out the results
have been affected after changing the Trainer precision parameter.
El mié., 24 abr. 2024 14:25, Isaac Corley ***@***.***>
escribió:
… Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you
install PyTorch that's built with 12.2. They have some instructions on
their website for this. The batch size defaults to 8. The train script has
a --batch_size argument you can adjust.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAYGIUFBV6HP2JD7HKGUOXLY66QE7AVCNFSM6AAAAABGWU7L3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHAZDONRZG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
I'm not able to reproduce this either. Where are you printing the batch? |
Inside the training_step() method on the change_detection.py script.
|
While executing the code with the three UNET approaches were proposed in this repository, the model returns fully NaN tensors as output, which leads to the message on the title "NaN or Inf found in input tensor" when the program tries to calculate the loss.
I think the problem could be related to the Automatic Mixed Precission (AMP) (right now it is set to 16 bit), but I'm not sure.
Could you help me? What can I do?
Thank you very much in advance.
The text was updated successfully, but these errors were encountered: