Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan loss in MuLaN training #20

Closed
ukemamaster opened this issue Feb 16, 2023 · 11 comments
Closed

nan loss in MuLaN training #20

ukemamaster opened this issue Feb 16, 2023 · 11 comments

Comments

@ukemamaster
Copy link

@lucidrains
While training MuLaN on a dataset of around 5.2k samples, the loss goes to nan after some 15-16k steps.
My batch size is 4, and the text part of the data samples are tokenized using:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text_in_numbers = tokenizer.encode(text)

Does it has something to do with the zero division? or square-root of 0 in the loss function?

@lucidrains
Copy link
Owner

it could be a number things, but definitely overtrained

also, for contrastive learning, a batch size of 4 is too small. even 64 is too small

yeah, this can only realistically be done over at open clip

@lucidrains
Copy link
Owner

how do your loss curves look before it diverged?

@ukemamaster
Copy link
Author

ukemamaster commented Feb 17, 2023

@lucidrains
1. This is how the loss looks like.
After some 20k steps, it becomes inf, and after a few more steps it becomes nan. But before nan even the real values plot looks very strange to me.
mulan_loss

2. Regarding the batch size, i tried to increase but i get memory error:

RuntimeError: CUDA out of memory. Tried to allocate 38.27 GiB (GPU 0; 23.65 GiB total capacity; 3.28 GiB already allocated; 18.28 GiB free; 4.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also, i am only feeding 5.5 seconds of audio (at 16kHz) i.e., 88k samples to the model. If i increase this, i get the same memory error. Even with your mockdataclass (not real data).

RuntimeError: CUDA out of memory. Tried to allocate 7.55 GiB (GPU 0; 23.65 GiB total capacity; 16.45 GiB already allocated; 6.21 GiB free; 16.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So, increasing batch size, or audio length, gives memory error. Does the model has some max limit for audio?

3. Regarding integration in open clip, are you working on it? How long it may take?

@lucidrains
Copy link
Owner

@ukemamaster (1) that's very abnormal haha (2) yeah, you need a lot of memory (3) i have an issue at open clip and an active PR that i'm working on, feel free to subscribe!

@ukemamaster
Copy link
Author

ukemamaster commented Feb 20, 2023

@lucidrains This is the PR you are talking about, right?
If it gets aaccepted, then what? Will we still need to train mulan? or (as usual they will put a model card on huggingface, and) we will get a pretrained one, ready to use?

@lucidrains
Copy link
Owner

lucidrains commented Feb 20, 2023

@ukemamaster yes, that's the one. it will have to be trained and then served as a pretrained model. i'll write a wrapper for use within this repository, similar to here. you should get involved; there's many researchers and startups interested in this

@deepak-newzera
Copy link

deepak-newzera commented Feb 21, 2023

@lucidrains I am also waiting eagerly for a pre-trained model, for immediate use. Please make it available asap. Thank you.

@Mingxiangyu
Copy link

@ukemamaster yes, that's the one. it will have to be trained and then served as a pretrained model. i'll write a wrapper for use within this repository, similar to here. you should get involved; there's many researchers and startups interested in this @ukemamaster 是的,就是这个。它将必须被训练,然后被用作预训练的模型。我将编写一个在这个存储库中使用的包装器,类似于这里。你应该参与进来有很多研究人员和创业公司对此感兴趣

@lucidrains Thank you for sharing and looking forward to your progress

@Mingxiangyu
Copy link

@lucidrains 就是你说的公关,对吧? 如果它被接受了,然后呢?我们还需要训练花木兰吗?或者(像往常一样,他们会在 huggingface 上放一张模型卡,并且)我们会得到一个预训练的,可以使用?

@ukemamaster Have you trained in the end? I have also encountered this problem now:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.49 GiB (GPU 0; 12.00 GiB total capacity; 10.17 GiB already allocated; 0 bytes free; 10.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Mingxiangyu
Copy link

@lucidrains我也在热切地等待预训练模型,以便立即使用。请尽快提供。谢谢。

@deepak-newzera Did you train successfully in the end? I have also encountered this problem now
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.49 GiB (GPU 0; 12.00 GiB total capacity; 10.17 GiB already allocated; 0 bytes free; 10.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@skychwang
Copy link

to add on to this, using a batch size of 512, similar loss curve -> nan loss behavior,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants