nan loss in MuLaN training #20

ukemamaster · 2023-02-16T08:48:00Z

@lucidrains
While training MuLaN on a dataset of around 5.2k samples, the loss goes to nan after some 15-16k steps.
My batch size is 4, and the text part of the data samples are tokenized using:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text_in_numbers = tokenizer.encode(text)

Does it has something to do with the zero division? or square-root of 0 in the loss function?

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-02-17T01:51:29Z

it could be a number things, but definitely overtrained

also, for contrastive learning, a batch size of 4 is too small. even 64 is too small

yeah, this can only realistically be done over at open clip

lucidrains · 2023-02-17T01:52:03Z

how do your loss curves look before it diverged?

ukemamaster · 2023-02-17T08:49:48Z

@lucidrains
1. This is how the loss looks like.
After some 20k steps, it becomes inf, and after a few more steps it becomes nan. But before nan even the real values plot looks very strange to me.

2. Regarding the batch size, i tried to increase but i get memory error:

RuntimeError: CUDA out of memory. Tried to allocate 38.27 GiB (GPU 0; 23.65 GiB total capacity; 3.28 GiB already allocated; 18.28 GiB free; 4.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also, i am only feeding 5.5 seconds of audio (at 16kHz) i.e., 88k samples to the model. If i increase this, i get the same memory error. Even with your mockdataclass (not real data).

RuntimeError: CUDA out of memory. Tried to allocate 7.55 GiB (GPU 0; 23.65 GiB total capacity; 16.45 GiB already allocated; 6.21 GiB free; 16.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So, increasing batch size, or audio length, gives memory error. Does the model has some max limit for audio?

3. Regarding integration in open clip, are you working on it? How long it may take?

lucidrains · 2023-02-17T15:46:45Z

@ukemamaster (1) that's very abnormal haha (2) yeah, you need a lot of memory (3) i have an issue at open clip and an active PR that i'm working on, feel free to subscribe!

ukemamaster · 2023-02-20T08:30:36Z

@lucidrains This is the PR you are talking about, right?
If it gets aaccepted, then what? Will we still need to train mulan? or (as usual they will put a model card on huggingface, and) we will get a pretrained one, ready to use?

lucidrains · 2023-02-20T16:18:51Z

@ukemamaster yes, that's the one. it will have to be trained and then served as a pretrained model. i'll write a wrapper for use within this repository, similar to here. you should get involved; there's many researchers and startups interested in this

deepak-newzera · 2023-02-21T05:08:12Z

@lucidrains I am also waiting eagerly for a pre-trained model, for immediate use. Please make it available asap. Thank you.

Mingxiangyu · 2023-04-16T03:25:12Z

@ukemamaster yes, that's the one. it will have to be trained and then served as a pretrained model. i'll write a wrapper for use within this repository, similar to here. you should get involved; there's many researchers and startups interested in this @ukemamaster 是的，就是这个。它将必须被训练，然后被用作预训练的模型。我将编写一个在这个存储库中使用的包装器，类似于这里。你应该参与进来有很多研究人员和创业公司对此感兴趣

@lucidrains Thank you for sharing and looking forward to your progress

Mingxiangyu · 2023-04-19T14:21:01Z

@lucidrains 这就是你说的公关，对吧？如果它被接受了，然后呢？我们还需要训练花木兰吗？或者（像往常一样，他们会在 huggingface 上放一张模型卡，并且）我们会得到一个预训练的，可以使用？

@ukemamaster Have you trained in the end? I have also encountered this problem now:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.49 GiB (GPU 0; 12.00 GiB total capacity; 10.17 GiB already allocated; 0 bytes free; 10.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Mingxiangyu · 2023-04-19T14:21:55Z

@lucidrains我也在热切地等待预训练模型，以便立即使用。请尽快提供。谢谢。

@deepak-newzera Did you train successfully in the end? I have also encountered this problem now
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.49 GiB (GPU 0; 12.00 GiB total capacity; 10.17 GiB already allocated; 0 bytes free; 10.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

skychwang · 2024-07-17T03:11:52Z

to add on to this, using a batch size of 512, similar loss curve -> nan loss behavior,

lucidrains closed this as completed Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan loss in MuLaN training #20

nan loss in MuLaN training #20

ukemamaster commented Feb 16, 2023

lucidrains commented Feb 17, 2023

lucidrains commented Feb 17, 2023

ukemamaster commented Feb 17, 2023 •

edited

Loading

lucidrains commented Feb 17, 2023

ukemamaster commented Feb 20, 2023 •

edited

Loading

lucidrains commented Feb 20, 2023 •

edited

Loading

deepak-newzera commented Feb 21, 2023 •

edited

Loading

Mingxiangyu commented Apr 16, 2023

Mingxiangyu commented Apr 19, 2023

Mingxiangyu commented Apr 19, 2023

skychwang commented Jul 17, 2024

nan loss in MuLaN training #20

nan loss in MuLaN training #20

Comments

ukemamaster commented Feb 16, 2023

lucidrains commented Feb 17, 2023

lucidrains commented Feb 17, 2023

ukemamaster commented Feb 17, 2023 • edited Loading

lucidrains commented Feb 17, 2023

ukemamaster commented Feb 20, 2023 • edited Loading

lucidrains commented Feb 20, 2023 • edited Loading

deepak-newzera commented Feb 21, 2023 • edited Loading

Mingxiangyu commented Apr 16, 2023

Mingxiangyu commented Apr 19, 2023

Mingxiangyu commented Apr 19, 2023

skychwang commented Jul 17, 2024

ukemamaster commented Feb 17, 2023 •

edited

Loading

ukemamaster commented Feb 20, 2023 •

edited

Loading

lucidrains commented Feb 20, 2023 •

edited

Loading

deepak-newzera commented Feb 21, 2023 •

edited

Loading