You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.
[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
Traceback (most recent call last):
File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in
state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
state_dict = torch.load(f, map_location=device)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive
Process finished with exit code 1
[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov 5 18:17 mp_rank_00_model_states.pt
When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.
[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)
Traceback (most recent call last):
File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in
state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
state_dict = torch.load(f, map_location=device)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive
Process finished with exit code 1
[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov 5 18:17 mp_rank_00_model_states.pt
When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: