megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

Lzhang-hub · 2024-05-29T09:00:31Z

For megatron-lm train with flash-ckpt, when set pipeline parallel, can not save sucessfully. It seems to not all ckpt save to memory.
Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.

save log :

[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 0.006s.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 0.007s.
[2024-05-29 08:54:18,536] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,537] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,705] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 0.175s.
[2024-05-29 08:54:18,717] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 0.188s.
[2024-05-29 08:54:18,767] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 0.237s.
[2024-05-29 08:54:18,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 0.34s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=60, global_shard_num=0)
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 0.343s.
[2024-05-29 08:54:33,889] [INFO] [ckpt_saver.py:630:_check_shard_step_consistence] The cached steps are [60, 0, 60, 0, 60, 0, 60, 0]
[2024-05-29 08:54:33,889] [WARNING] [ckpt_saver.py:804:save_step_checkpoint] Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.

The text was updated successfully, but these errors were encountered:

workingloong · 2024-06-03T06:37:55Z

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.

dlrover-run --max-restarts=2  --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
       --tensor-model-parallel-size $TP_SIZE \
       --pipeline-model-parallel-size $PP_SIZE \
	   --use-distributed-optimizer \
       --num-layers 48 \
       --hidden-size 1600 \
       --num-attention-heads 16 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --micro-batch-size 4 \
       --global-batch-size 8 \
       --train-iters 100 \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file $VOCAB_FILE \
       --merge-file $MERGE_FILE \
       --split 900,50,50 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --min-lr 1.0e-5 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --log-interval 1 \
       --save-interval 100 \
       --eval-interval 1000 \
       --eval-iters 10

Lzhang-hub · 2024-06-03T07:34:30Z

which version of dlrover? Beside, have you test with torchrun.

I test dlrover 0.3.7 with torchrun, with this repo ,still error.

workingloong · 2024-06-03T07:56:34Z

dlrover[torch]==0.3.7. I have reproduced the issue if I do not use --use-distributed-optimizer.

Lzhang-hub · 2024-06-03T08:14:21Z

I add --use-distributed-optimizer,get a new error.
Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

Lzhang-hub · 2024-06-05T04:14:07Z

I add --use-distributed-optimizer,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

This error is related to the storage I use and can be ignored.

jackielyc · 2024-08-20T04:48:48Z

The job will be hang in the end.

TomSuen · 2024-10-31T02:38:24Z

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.
dlrover-run --max-restarts=2  --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
       --tensor-model-parallel-size $TP_SIZE \
       --pipeline-model-parallel-size $PP_SIZE \
	   --use-distributed-optimizer \


Hi, could dlrover-run using with deepspeed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

Lzhang-hub commented May 29, 2024

workingloong commented Jun 3, 2024 •

edited

Loading

Lzhang-hub commented Jun 3, 2024

workingloong commented Jun 3, 2024

Lzhang-hub commented Jun 3, 2024

Lzhang-hub commented Jun 5, 2024

jackielyc commented Aug 20, 2024

TomSuen commented Oct 31, 2024

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel #1146

Comments

Lzhang-hub commented May 29, 2024

workingloong commented Jun 3, 2024 • edited Loading

Lzhang-hub commented Jun 3, 2024

workingloong commented Jun 3, 2024

Lzhang-hub commented Jun 3, 2024

Lzhang-hub commented Jun 5, 2024

jackielyc commented Aug 20, 2024

TomSuen commented Oct 31, 2024

workingloong commented Jun 3, 2024 •

edited

Loading