Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"NaN" Training: git repo error? #365

Open
PhillipsML opened this issue Aug 6, 2021 · 19 comments
Open

"NaN" Training: git repo error? #365

PhillipsML opened this issue Aug 6, 2021 · 19 comments

Comments

@PhillipsML
Copy link

Hi -
I'm trying to train in APT and running into a weird issue: The training appeared to be running with no errors in the log, but the loss graphs at the top had no data points. Upon further review, it appears the values are nan which could explain the blank graphs. Working my way up in the log files, I found a "fatal: not a git repository" which leads to a "stopping at filesystem boundary". I'm not sure if this would cause the NaN issue, and have spent some time verifying and re-verifying that indeed the APT folder is recognized by Matlab as a git repository.
I'm in ubuntu 20.04, matlab 2021a. Docker backend: passes all APT tests for access to GPU.
I would appreciate any insight you would have into this issue - really looking forward to getting APT up and running! I've attached the log files, as I mentioned there were no error files.
Best-
Mary

LogCodes_NaN.odt

@allenleetc
Copy link
Collaborator

Hey Mary,
Good to hear from you! Hmm, that is strange. Yes we noticed the git repository message recently -- I am guessing this is not the cause of your issue though, as your optimization does start and run. This message looked a little concerning:

RuntimeWarning: Mean of empty slice label_mean = np.nanmean(val_dist)

Guessing you are using the MDN network? Are you able to share your project? I think just the .lbl file (without any movies) might be helpful. Mayank @mkabra any ideas?

@PhillipsML
Copy link
Author

Hi Allen! Been a long time - hope all is well! I am using the MDN, but recapitulated the same issue when I tried with the DLC. Further detail: the training hangs up "building training dataset" for quite a long time before progressing to the "training", and will continue iterating until killed. Below is the link to the .lbl
https://maxplanckflorida-my.sharepoint.com/:u:/g/personal/phillipsm_mpfi_org/ESpOXtKt2qJCr8vJmcyhzJwB2YVg95wNgVW_AB5L18ObhA?e=fJQ2dO

@allenleetc
Copy link
Collaborator

Hey Mary, your link seem to require maxplanck credentials, does that sound right? I couldn't access the file.

Also, just FYI that HHMI has a holiday this coming week so we will be slower to respond.

@PhillipsML
Copy link
Author

Ah - sorry about that. Better link below. And no worries about the holiday, appreciate any advice you guys can give
https://drive.google.com/file/d/1VUQaVihKRe1OBwJVQXBXQzO3Q7ZL2t_0/view?usp=sharing

@allenleetc
Copy link
Collaborator

Huh, strange, so far I don't seem to reproduce the NaN losses running with the latest code on develop. Will think some more and Mayank will likely have ideas.

Slightly off topic but maybe relevant. Your project cache suggests you have tried Restarting the training (this would be via the Blue "Restart" button on the Training Monitor.) Did/do you press this button as part of your workflow?

@allenleetc
Copy link
Collaborator

@PhillipsML OK a bit more digging and so far we can't reproduce this issue with the cached state in the project. For a next iteration of debugging any of the following would be helpful:

  • We pushed a few small changes to the APT repo, so it may be useful to pull those first before proceeding.
  • Can you manually confirm the git commit (SHA) for your APT repo? You can do this in a Linux terminal with eg git describe while in the repo directory; followed by a git status which will detect any local changes. (Did you clone the APT repo using the Linux command line, or another tool?)
  • Is it possible to include your movies in the Google Drive so we can try training start-to-finish?
  • A cut+paste of everything printed in the MATLAB command window at the start of the train (from the time you press Train until after the Train is initiated and the Training Monitor comes up) could be useful. (This can be pretty long, but by the time a single iteration is done it should be done printing.)
  • As I mentioned above, if you ever 'Restart' your training please let us know. We are debugging as if the issue occurs on a fresh Train.

Thanks! Allen

@PhillipsML
Copy link
Author

@allenleetc
-I've repulled APT, so recapitulated the error with the new updates

  • I've tried with both linux terminal "git clone" and with matlab github interface (this was when trying to get rid of the not a git directory error)
  • git desribe: v2.0-3036-ga5329647
  • git status: Your branch is up to date with 'origin/develop'.
  • This drive has files I've been working with, one project with one mouse and the other 3. On the 3 mice, I've tried DLC and MDN to see if I can fix the error. In the DLC, it did seem to train alright once, but has since begun hanging up in the building training image dataset phase. MDN has not worked. I've tried finding the most helpful log files: I've been trying a lot of different variations to get around the issue. https://drive.google.com/drive/folders/1erJNK0Et3kdYDUguh3g1YyZOhy379O9b?usp=sharing
  • I'm not sure about the differences between DLC and MDN training workflows, but when I was combing through the .apt logfiles I found a DLC training image set whereas the MDN did not get to that stage.
  • I have restarted the training (in the OneMouse) as the log file failed to update in more than 5 minutes and in the text it suggesting stopping and restarting. I have not restarted in some instances and run into the same error.
  • The Matlab command window output is saved in the 3 mouse folder.

I had APT working well on an older linux machine but when switching to our new dataprocessing computer have had these issues... So excited to take our new GPU for a ride though! I really appreciate your help with this. I'll keep on trying on my end to get around this, let me know if any additional files / information would be helpful for you.

Best - Mary

@mkabra
Copy link
Collaborator

mkabra commented Aug 13, 2021 via email

@allenleetc
Copy link
Collaborator

allenleetc commented Aug 13, 2021

I wonder if GPU memory could be a factor. Mary have you ever tried reducing your 'Training batch size'? When setting Tracking Parameters, this is under DeepTrack>GradientDescent>Training batch size.

Just had an experience that may be relevant. I opened your OneMouse project and removed movies 2-5 as those were not in the Google Drive (this should be pretty fine for testing). On my first train of MDN (on Docker/Ubuntu), I left all parameters unchanged. My GPU is an RTX 2080 Ti, which has 11GB of GPU RAM. The Train hung up on the "Building training database" stage, but without throwing any out-of-memory errors or the like. Everything just hung up.

Then I ran the train directly from the commandline, still with batch size = 8. This time, I got out-of-memory errors! Not sure why it threw errors this time and not before.

Finally, I reduced the batch size to 2 within the APT GUI, and this time the train ran successfully. The Tracking Parameters resource estimator suggests that GPU memory could be an issue (for my card) with batch size set to 8. It is estimating a GPU memory requirement of ~19GB which would require a pretty chunky GPU card.

Another option to reduce the GPU memory requirement would be to increase the ImageProcessing > Downsample factor. Just some ideas maybe worth trying if a resource constraint could be at play.

@PhillipsML
Copy link
Author

@allenleetc you may be on to something! I just set up a new lbl to try some different parameters: using DLC at the moment and it's running. I've checked my GPU usage and I'm about maxed, my GeForce RTX 3090 is 24 GB. I wonder if when I use the MDN without downsampling I just hit a memory wall. I will try that next and see.
@mkabra I've added the trx files for the 3 mouse project, GPU is the GeForce RTX 3090. Let me look into the memory load of what I was trying to train and see if that's why I'm being hung up at the building database stage.
Thanks so much - I'll play some more and get back to you

@Hexusprime
Copy link

Hi there! Nice to meet you all! :)

My name is Matthew, I’m part of the helpdesk staff here at MPFI.

Just wanted to add a bit onto what I’ve found in regards to this current issue:

With a fresh install of Ubuntu 20.04, following the install instructions and then running the docker backend test, all tests well and it seems that the computer should be ready to train.

image

The issue arises when going to train, the training monitor appears, but nothing ever happens afterwards sadly.

I can confirm that the process has started as well by leaving a terminal window running in the backround with this command , “watch -d -n 0.5 nvidia-smi” , you can see the process appear and the gpu temp/memory usage slowly go up.

image

Unfortunately no blue line ever appears afterwards.

I’ve tried configuring the tracker as well to take smaller batches (2) and down sampling all the way to 4 at the same time to see if there was some invisible limit being hit, but the same results happen.

The training monitor will open and it will hang on the first job and never progress, despite iterations completing I’ll only see a single point, never a line.

image

Note that I’ve set this program up the same way on another machine using the GTX 1080 and that starts training/works flawlessly with no issues, however on this machine with the RTX 3090, it refuses to progress.

I don’t see any error messages in the training monitor either, and I’m currently using the latest pull of APT as well.

Mary has mentioned it’s because it’s training with NaN’s, and when you go to stop training it even mentions as such:

image

Is there a way I could share some system logs or anything at all to help troubleshoot this? I’d also be open to zooming to show you the issue at hand if you’d be interested in taking a look.

Thank you so much for all your help, super excited to see the end results of using this powerful card in this process. :)

Matthew Morgan
ITS Helpdesk Tech.

@allenleetc
Copy link
Collaborator

Hey Matthew,

Thanks for the detailed report! It certainly does look like you are getting NaNs during training. One way to confirm is to select "Show log files" in the Training Monitor and press "Go" to get a print-out of the training log; if you can attach/cut+paste the entire log here, that would be useful.

Another useful thing would be to Save (or Save as...) the project after stopping the train (ideally, let it run for a brief while). This is as prompted by the last Dialog box in your report. Just the project (.lbl file) would be useful. Is the project basically similar to Mary's "OneMouse" and "ThreeMouse" project she includes above?

@mkabra This seems to be suggesting that the RTX 3090 is the culprit do you agree? A quick search has turned up a bunch of similar reports.

@Hexusprime
Copy link

Hexusprime commented Sep 23, 2021

Hi Allen!

Thank you so much for the quick response! I'll get that running now and should have those files available for you tomorrow morning along with the log.

I'll also confirm with Mary tomorrow the similarity in the projects.

Thanks!
Matthew

@mkabra
Copy link
Collaborator

mkabra commented Sep 23, 2021 via email

@Hexusprime
Copy link

Hexusprime commented Sep 23, 2021

Hi Mayank!

Thank you for the detailed reply! That's interesting to see how tensorflow is effecting the processes. I'll take a look at the param branch here when we get a chance.

Per Allen's request I'll still put the log files for both APT as well as the lbl file just incase anything can be learned and or derived from it. Anything we can do to help or test we'll do so gladly! :) I've also tossed in the log for matlab as well if that is of any interest.

You should be able to just click this
and view/download anything in here you need but let me know if it gives you any permission issues.

Thank you again for all your help, we'll give this a test and get back to you 👍

Matthew

@mkabra
Copy link
Collaborator

mkabra commented Sep 24, 2021 via email

@pascar4
Copy link

pascar4 commented Jan 13, 2022

While using a 1080 TI, the docker job is created then disappears in under a minute cancelling the training. (the backend test was successful)
Resulting in a NAN/20000.
I've checked the error log file and it is blank.
Maybe this is the same issue with TF or maybe a different issue? any insight would help
APT_NAN_Error

@mkabra
Copy link
Collaborator

mkabra commented Jan 14, 2022 via email

@evo11x
Copy link

evo11x commented Jun 21, 2022

I have the same problem with ray/pytorch cuda 11.7 and RTX3060, it seems this problem comes from nvidia cuda 11.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants