Replies: 2 comments 1 reply
-
Thanks for your question @erayinanc , pinging @coquelin77 as the most qualified to answer here. |
Beta Was this translation helpful? Give feedback.
0 replies
-
NaNs in training can come from any number of things. If it was related to the data it would likely be shown as an error and not as NaNs. The most likely culprit is a learning rate which is too large which can cause the loss value to explode towards NaNs. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For training on an HPC with a multi-GPU per node, I am trying to load a bunch of .hdf5 files in a folder using torchvision. For this, in PyTorch, I can use
torchvision.datasets.DatasetFolder
, then define a sampler withtorch.utils.data.distributed.DistributedSampler
to for the data loader. But in HeAT, I was not able to find DatasetFolder class, or similar. So, I tried to combine torchvision with heat, asThis approach works if each node has 1 GPU, but fails if more than that -- probably due to torchvision part -- getting NaNs in training.
So, my question is how one should do this in HeAT properly.
All the best,
Eray
Beta Was this translation helpful? Give feedback.
All reactions