Irregular Loss Pattern ; getting "Loss:NaN" #45

Respaired · 2024-02-10T07:29:48Z

TL;DR:

Encountering frequent NaN values mainly for the Loss, during training with a large JPN dataset (10.5 million rows).
No such issues with another, albeit smaller dataset (800,000 rows).
Should I ignore NaN values or reverting to the other dataset? considering how much smaller it is.
Attempted to disable mixed precision during training, but the issue remains unresolved.

Hi. I'm trying to train a PL-Bert on Japanese language. I used the entirety of this dataset for that purpose.

Somehow I'm getting a lot of NaN, while Vocab Loss(for the most part, on some rare occassions I also get NaN for this as well) and Tokenizer Loss seem to be doing fine. I've also tried using the whole vocab size of the tokenizer in case something was wrong with the way I pruned it, but nope, still getting the same.

If i decrease the log steps (to 10 for instance), I see Loss to be around 2 to 3 but then goes to NaN, and back and forth.

Step [5100/1000000], Loss: nan, Vocab Loss: 1.12363, Token Loss: 2.01707
Step [5200/1000000], Loss: nan, Vocab Loss: 1.15805, Token Loss: 1.97737
Step [5300/1000000], Loss: nan, Vocab Loss: 1.24844, Token Loss: 1.88506
Step [5400/1000000], Loss: nan, Vocab Loss: 1.18666, Token Loss: 1.90820
Step [5500/1000000], Loss: nan, Vocab Loss: 1.33804, Token Loss: 2.04283
Step [5600/1000000], Loss: nan, Vocab Loss: 1.18824, Token Loss: 1.99786
Step [5700/1000000], Loss: nan, Vocab Loss: 0.98660, Token Loss: 1.84933
Step [5800/1000000], Loss: nan, Vocab Loss: 1.19794, Token Loss: 2.06009
Step [5900/1000000], Loss: nan, Vocab Loss: 1.12529, Token Loss: 2.08546
Step [6000/1000000], Loss: nan, Vocab Loss: 1.10970, Token Loss: 1.98083
Step [6100/1000000], Loss: nan, Vocab Loss: nan, Token Loss: 1.96394
Step [6200/1000000], Loss: nan, Vocab Loss: 1.10657, Token Loss: 1.97735

I should say that I'm seeing this pattern only on this particular dataset; I ran a short test session on this one, while keeping everything else constant and unchanged. this one seem to be working fine. Should I simply ignore the NaN or should I change back my dataset? (the problematic dataset is roughly 10.5M rows, if a good model can be trained with 800k (the dataset that works fine) then I guess I should do that?)

I have also tried disabling the mixed_precision, but it still did not help.

Here's my config:

log_dir: "Checkpoint"
mixed_precision: "fp16"
data_folder: "/home/ubuntu/001_PLBERT_JA/PL-BERT/jpn_wiki"
batch_size: 72
save_interval: 5000
log_interval: 100
num_process: 1 # number of GPUs
num_steps: 1000000

dataset_params:
    tokenizer: "cl-tohoku/bert-base-japanese-v2"
    token_separator: " " # token used for phoneme separator (space)
    token_mask: "M" # token used for phoneme mask (M)
    word_separator: 14 # token used for word separator (<unused9>)
    token_maps: "token_maps.pkl" # token map path
    
    max_mel_length: 512 # max phoneme length
    
    word_mask_prob: 0.15 # probability to mask the entire word
    phoneme_mask_prob: 0.1 # probability to mask each phoneme
    replace_prob: 0.2 # probablity to replace phonemes
    
model_params:
    vocab_size: 178
    hidden_size: 768
    num_attention_heads: 12
    intermediate_size: 2048
    max_position_embeddings: 512
    num_hidden_layers: 12
    dropout: 0.1

I'm training on 2x V100s (32gb each)
Thank you very much.

The text was updated successfully, but these errors were encountered:

Respaired · 2024-02-11T14:07:06Z

I ended up changing the dataset this time to more than 90M sentences, though I still want to know what was the problem.

Also, is it natural to get such a low Vocab loss ? (around 0.4), I increased the dataset more than 10x times the amount it was before but the training time per step didn't increase much, didn't expect that. (45 min per 5000 steps)

I also opted out of pruning the tokenizer, in case I wanted to add some more data later. fortunately the vocab size of the original tokenizer wasn't too much to begin with. (32k)

Step [13000/1000000], Loss: 1.71645, Vocab Loss: 0.47993, Token Loss: 1.28277
Step [13100/1000000], Loss: 1.75661, Vocab Loss: 0.52135, Token Loss: 1.38908
Step [13200/1000000], Loss: 1.71416, Vocab Loss: 0.41433, Token Loss: 1.27051
Step [13300/1000000], Loss: 1.72707, Vocab Loss: 0.51451, Token Loss: 1.24440
Step [13400/1000000], Loss: 1.73744, Vocab Loss: 0.43977, Token Loss: 1.18668
Step [13500/1000000], Loss: 1.73052, Vocab Loss: 0.46005, Token Loss: 1.22199
Step [13600/1000000], Loss: 1.71264, Vocab Loss: 0.48650, Token Loss: 1.20200
Step [13700/1000000], Loss: 1.70186, Vocab Loss: 0.41886, Token Loss: 1.11467
Step [13800/1000000], Loss: 1.71243, Vocab Loss: 0.38319, Token Loss: 1.05600
Step [13900/1000000], Loss: 1.71505, Vocab Loss: 0.45034, Token Loss: 1.27390
Step [14000/1000000], Loss: 1.69532, Vocab Loss: 0.46685, Token Loss: 1.06771
Step [14100/1000000], Loss: 1.66890, Vocab Loss: 0.38901, Token Loss: 1.20284
Step [14200/1000000], Loss: 1.67050, Vocab Loss: 0.45279, Token Loss: 1.24074
Step [14300/1000000], Loss: 1.70772, Vocab Loss: 0.46576, Token Loss: 1.46027
Step [14400/1000000], Loss: 1.67857, Vocab Loss: 0.41315, Token Loss: 1.20940

Since I don't have a validation loss, I can't be sure how well I'm doing. one last thing, is how many steps should I fine tune on the downstream TTS task before I risk overfitting? (let's say for a dataset on the size of LJSpeech)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Irregular Loss Pattern ; getting "Loss:NaN" #45

Irregular Loss Pattern ; getting "Loss:NaN" #45

Respaired commented Feb 10, 2024 •

edited

Loading

Respaired commented Feb 11, 2024 •

edited

Loading

Irregular Loss Pattern ; getting "Loss:NaN" #45

Irregular Loss Pattern ; getting "Loss:NaN" #45

Comments

Respaired commented Feb 10, 2024 • edited Loading

Respaired commented Feb 11, 2024 • edited Loading

Respaired commented Feb 10, 2024 •

edited

Loading

Respaired commented Feb 11, 2024 •

edited

Loading