-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan loss in MuLaN training #20
Comments
it could be a number things, but definitely overtrained also, for contrastive learning, a batch size of 4 is too small. even 64 is too small yeah, this can only realistically be done over at open clip |
how do your loss curves look before it diverged? |
@lucidrains 2. Regarding the batch size, i tried to increase but i get memory error: RuntimeError: CUDA out of memory. Tried to allocate 38.27 GiB (GPU 0; 23.65 GiB total capacity; 3.28 GiB already allocated; 18.28 GiB free; 4.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Also, i am only feeding 5.5 seconds of audio (at 16kHz) i.e., 88k samples to the model. If i increase this, i get the same memory error. Even with your mockdataclass (not real data). RuntimeError: CUDA out of memory. Tried to allocate 7.55 GiB (GPU 0; 23.65 GiB total capacity; 16.45 GiB already allocated; 6.21 GiB free; 16.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF So, increasing batch size, or audio length, gives memory error. Does the model has some max limit for audio? 3. Regarding integration in open clip, are you working on it? How long it may take? |
@ukemamaster (1) that's very abnormal haha (2) yeah, you need a lot of memory (3) i have an issue at open clip and an active PR that i'm working on, feel free to subscribe! |
@lucidrains This is the PR you are talking about, right? |
@ukemamaster yes, that's the one. it will have to be trained and then served as a pretrained model. i'll write a wrapper for use within this repository, similar to here. you should get involved; there's many researchers and startups interested in this |
@lucidrains I am also waiting eagerly for a pre-trained model, for immediate use. Please make it available asap. Thank you. |
@lucidrains Thank you for sharing and looking forward to your progress |
@ukemamaster Have you trained in the end? I have also encountered this problem now: |
@deepak-newzera Did you train successfully in the end? I have also encountered this problem now |
to add on to this, using a batch size of 512, similar loss curve -> nan loss behavior, |
@lucidrains
While training
MuLaN
on a dataset of around 5.2k samples, the loss goes tonan
after some 15-16k steps.My batch size is 4, and the
text
part of the data samples are tokenized using:Does it has something to do with the zero division? or square-root of 0 in the loss function?
The text was updated successfully, but these errors were encountered: