PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Crazy-LittleBoy · 2024-11-06T05:05:11Z

Bug description

After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.

[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)

Traceback (most recent call last):
File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in
state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
state_dict = torch.load(f, map_location=device)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

Process finished with exit code 1

[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov 5 18:17 mp_rank_00_model_states.pt

When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

Crazy-LittleBoy · 2024-11-06T08:20:57Z

import zipfile
with zipfile.ZipFile('mp_rank_00_model_states.pt', 'r') as zip_ref:

zip_ref.printdir()
File Name Modified Size
archive/data.pkl 1980-00-00 00:00:00 140339
archive/byteorder 1980-00-00 00:00:00 6
archive/data/0 1980-00-00 00:00:00 2530828288
archive/data/1 1980-00-00 00:00:00 1097936
archive/data/2 1980-00-00 00:00:00 4
archive/data/3 1980-00-00 00:00:00 4
archive/version 1980-00-00 00:00:00 2
archive/.data/serialization_id

I found archive/byteorder file is damaged, it's value should be b'little'

Crazy-LittleBoy · 2024-11-06T10:04:46Z

Qwen2-0.5B work fine, But Qwen2-1.5B failed.

Crazy-LittleBoy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 6, 2024

github-actions bot added the ver: 2.4.x label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Crazy-LittleBoy commented Nov 6, 2024

Crazy-LittleBoy commented Nov 6, 2024 •

edited

Loading

Crazy-LittleBoy commented Nov 6, 2024

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Comments

Crazy-LittleBoy commented Nov 6, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Crazy-LittleBoy commented Nov 6, 2024 • edited Loading

Crazy-LittleBoy commented Nov 6, 2024

Crazy-LittleBoy commented Nov 6, 2024 •

edited

Loading