Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Open
Crazy-LittleBoy opened this issue Nov 6, 2024 · 2 comments
Open
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@Crazy-LittleBoy
Copy link

Bug description

After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.

[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)

Traceback (most recent call last):
File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in
state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
state_dict = torch.load(f, map_location=device)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

Process finished with exit code 1

[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov 5 18:17 mp_rank_00_model_states.pt

When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@Crazy-LittleBoy Crazy-LittleBoy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 6, 2024
@Crazy-LittleBoy
Copy link
Author

Crazy-LittleBoy commented Nov 6, 2024

import zipfile
with zipfile.ZipFile('mp_rank_00_model_states.pt', 'r') as zip_ref:

zip_ref.printdir()
File Name Modified Size
archive/data.pkl 1980-00-00 00:00:00 140339
archive/byteorder 1980-00-00 00:00:00 6
archive/data/0 1980-00-00 00:00:00 2530828288
archive/data/1 1980-00-00 00:00:00 1097936
archive/data/2 1980-00-00 00:00:00 4
archive/data/3 1980-00-00 00:00:00 4
archive/version 1980-00-00 00:00:00 2
archive/.data/serialization_id

I found archive/byteorder file is damaged, it's value should be b'little'

@Crazy-LittleBoy
Copy link
Author

Qwen2-0.5B work fine, But Qwen2-1.5B failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant