You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Env: 16GPUs + llama2 pretrain+ megatron-lm strategy: TP 8 + PP 1 + DP 2 case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,
WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.
The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.
The text was updated successfully, but these errors were encountered:
Env: 16GPUs + llama2 pretrain+ megatron-lm
strategy: TP 8 + PP 1 + DP 2
case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,
The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.
The text was updated successfully, but these errors were encountered: