You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Encountering frequent NaN values mainly for the Loss, during training with a large JPN dataset (10.5 million rows).
No such issues with another, albeit smaller dataset (800,000 rows).
Should I ignore NaN values or reverting to the other dataset? considering how much smaller it is.
Attempted to disable mixed precision during training, but the issue remains unresolved.
Hi. I'm trying to train a PL-Bert on Japanese language. I used the entirety of this dataset for that purpose.
Somehow I'm getting a lot of NaN, while Vocab Loss(for the most part, on some rare occassions I also get NaN for this as well) and Tokenizer Loss seem to be doing fine. I've also tried using the whole vocab size of the tokenizer in case something was wrong with the way I pruned it, but nope, still getting the same.
If i decrease the log steps (to 10 for instance), I see Loss to be around 2 to 3 but then goes to NaN, and back and forth.
I should say that I'm seeing this pattern only on this particular dataset; I ran a short test session on this one, while keeping everything else constant and unchanged. this one seem to be working fine. Should I simply ignore the NaN or should I change back my dataset? (the problematic dataset is roughly 10.5M rows, if a good model can be trained with 800k (the dataset that works fine) then I guess I should do that?)
I have also tried disabling the mixed_precision, but it still did not help.
Here's my config:
log_dir: "Checkpoint"
mixed_precision: "fp16"
data_folder: "/home/ubuntu/001_PLBERT_JA/PL-BERT/jpn_wiki"
batch_size: 72
save_interval: 5000
log_interval: 100
num_process: 1 # number of GPUs
num_steps: 1000000
dataset_params:
tokenizer: "cl-tohoku/bert-base-japanese-v2"
token_separator: " " # token used for phoneme separator (space)
token_mask: "M" # token used for phoneme mask (M)
word_separator: 14 # token used for word separator (<unused9>)
token_maps: "token_maps.pkl" # token map path
max_mel_length: 512 # max phoneme length
word_mask_prob: 0.15 # probability to mask the entire word
phoneme_mask_prob: 0.1 # probability to mask each phoneme
replace_prob: 0.2 # probablity to replace phonemes
model_params:
vocab_size: 178
hidden_size: 768
num_attention_heads: 12
intermediate_size: 2048
max_position_embeddings: 512
num_hidden_layers: 12
dropout: 0.1
I'm training on 2x V100s (32gb each)
Thank you very much.
The text was updated successfully, but these errors were encountered:
I ended up changing the dataset this time to more than 90M sentences, though I still want to know what was the problem.
Also, is it natural to get such a low Vocab loss ? (around 0.4), I increased the dataset more than 10x times the amount it was before but the training time per step didn't increase much, didn't expect that. (45 min per 5000 steps)
I also opted out of pruning the tokenizer, in case I wanted to add some more data later. fortunately the vocab size of the original tokenizer wasn't too much to begin with. (32k)
Since I don't have a validation loss, I can't be sure how well I'm doing. one last thing, is how many steps should I fine tune on the downstream TTS task before I risk overfitting? (let's say for a dataset on the size of LJSpeech)
TL;DR:
Hi. I'm trying to train a PL-Bert on Japanese language. I used the entirety of this dataset for that purpose.
Somehow I'm getting a lot of NaN, while Vocab Loss(for the most part, on some rare occassions I also get NaN for this as well) and Tokenizer Loss seem to be doing fine. I've also tried using the whole vocab size of the tokenizer in case something was wrong with the way I pruned it, but nope, still getting the same.
If i decrease the log steps (to 10 for instance), I see Loss to be around 2 to 3 but then goes to NaN, and back and forth.
I should say that I'm seeing this pattern only on this particular dataset; I ran a short test session on this one, while keeping everything else constant and unchanged. this one seem to be working fine. Should I simply ignore the NaN or should I change back my dataset? (the problematic dataset is roughly 10.5M rows, if a good model can be trained with 800k (the dataset that works fine) then I guess I should do that?)
I have also tried disabling the mixed_precision, but it still did not help.
Here's my config:
I'm training on 2x V100s (32gb each)
Thank you very much.
The text was updated successfully, but these errors were encountered: