Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

Open
deepcoldfish opened this issue Aug 13, 2024 · 1 comment

Comments

@deepcoldfish
Copy link

Env: 16GPUs + llama2 pretrain+ megatron-lm
strategy: TP 8 + PP 1 + DP 2
case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,

WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.

@BalaBalaYi
Copy link
Collaborator

Can u provide more information? The more detailed, the better.
e.g.
Detail of killing. (failed cp step?, load cp step after failover?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants