-
Notifications
You must be signed in to change notification settings - Fork 725
Sub-workers exits without messages #692
Comments
I tried to go deeper and recover the errors causing exit. I ended up here:
Which made workers This is the command I launch the training.
Two CUDA GPUs are available. Single-GPU version (just setting CUDA_VISIBLE_DEVICES env variable) works well. |
Confirming that this error in my case too comes from |
@mahnerak I solved this by add num_workers=0. |
Thanks @GongZhengLi I don't think |
@mahnerak , did you solve it ? |
Not yet. Still waiting. |
@mahnerak @GongZhengLi |
🐛 Bug
I use the script as follow:
CUDA_VISIBLE_DEVICES="0, 1, 2, 3" metaseq-train --task streaming_language_modeling
data/pile-test/
--num-workers 4
--reset-dataloader
--vocab-filename ./vocab/gpt2-vocab.json
--merges-filename ./vocab/gpt2-merges.txt
--model-parallel-size 1
--ddp-backend fully_sharded
--task-ddp-backend fully_sharded
--criterion cross_entropy
--batch-size 8
--save-dir /checkpoints/lm_transformer_pile-00
--arch transformer_lm_gpt2_tiny --share-decoder-input-output-embed
--dropout 0.1
--optimizer adam --weight-decay 0.01 --clip-norm 0.0
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07
--tokens-per-sample 1024 --sample-break-mode none --fp16
--use-sharded-state
--decoder-learned-pos
--log-format json
--log-interval 1
The rank 1, 2, 3 was exit before the loop of train_step. I print the every detailed log and find that the iter() inside more_itertools.peekable() kill all the non-master processes.
What's the matter with this ?
The text was updated successfully, but these errors were encountered: