Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

Open
Lzhang-hub opened this issue May 29, 2024 · 7 comments

Comments

@Lzhang-hub
Copy link
Contributor

For megatron-lm train with flash-ckpt, when set pipeline parallel, can not save sucessfully. It seems to not all ckpt save to memory.
Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.

save log :

[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 0.006s.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 0.007s.
[2024-05-29 08:54:18,536] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,537] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,705] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 0.175s.
[2024-05-29 08:54:18,717] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 0.188s.
[2024-05-29 08:54:18,767] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 0.237s.
[2024-05-29 08:54:18,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 0.34s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=60, global_shard_num=0)
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 0.343s.
[2024-05-29 08:54:33,889] [INFO] [ckpt_saver.py:630:_check_shard_step_consistence] The cached steps are [60, 0, 60, 0, 60, 0, 60, 0]
[2024-05-29 08:54:33,889] [WARNING] [ckpt_saver.py:804:save_step_checkpoint] Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.
@workingloong
Copy link
Collaborator

workingloong commented Jun 3, 2024

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.

dlrover-run --max-restarts=2  --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
       --tensor-model-parallel-size $TP_SIZE \
       --pipeline-model-parallel-size $PP_SIZE \
	   --use-distributed-optimizer \
       --num-layers 48 \
       --hidden-size 1600 \
       --num-attention-heads 16 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --micro-batch-size 4 \
       --global-batch-size 8 \
       --train-iters 100 \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file $VOCAB_FILE \
       --merge-file $MERGE_FILE \
       --split 900,50,50 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --min-lr 1.0e-5 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --log-interval 1 \
       --save-interval 100 \
       --eval-interval 1000 \
       --eval-iters 10 

@Lzhang-hub
Copy link
Contributor Author

which version of dlrover? Beside, have you test with torchrun.

I test dlrover 0.3.7 with torchrun, with this repo ,still error.

@workingloong
Copy link
Collaborator

dlrover[torch]==0.3.7. I have reproduced the issue if I do not use --use-distributed-optimizer.

@Lzhang-hub
Copy link
Contributor Author

I add --use-distributed-optimizer,get a new error.
Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

@Lzhang-hub
Copy link
Contributor Author

I add --use-distributed-optimizer,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

This error is related to the storage I use and can be ignored.

@jackielyc
Copy link

The job will be hang in the end.

@TomSuen
Copy link

TomSuen commented Oct 31, 2024

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.

dlrover-run --max-restarts=2  --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
       --tensor-model-parallel-size $TP_SIZE \
       --pipeline-model-parallel-size $PP_SIZE \
	   --use-distributed-optimizer \

Hi, could dlrover-run using with deepspeed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants