Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

reshard_mp.py --num-output-parts 1 merges to smaller OPT file #695

Closed
larekrow opened this issue Mar 31, 2023 · 3 comments
Closed

reshard_mp.py --num-output-parts 1 merges to smaller OPT file #695

larekrow opened this issue Mar 31, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@larekrow
Copy link

larekrow commented Mar 31, 2023

🐛 Bug

Using reshard_mp.py with --num-output-parts 1 gives me a smaller file with OPT-1.3B and OPT-125M. However, behavior is expected for OPT-IML-30B.

To Reproduce

Steps to reproduce the behavior:

  1. Download OPT-1.3B parts from OPT download page.
  2. Setup and install Metaseq.
  3. Run reshard_mp.py:
python -m metaseq.scripts.reshard_mp \
    --input  "~/models/opt_1b3/reshard-model_part-*.pt" \
    --output "~/models/opt_1b3/singleton/reshard-model_part-{i}.pt" \
    --num-output-parts 1
  1. Script completes "without errors".
  2. Check sizes cd ~/models/; du -shc opt_1b3/*
1.3G    opt_1b3/reshard-model_part-0.pt
1.3G    opt_1b3/reshard-model_part-1.pt
376K    opt_1b3/singleton
2.5G    total

The same thing happens for OPT-125M:

122M    opt_125m/reshard-model_part-0.pt
122M    opt_125m/reshard-model_part-1.pt
64K     opt_125m/singleton
243M    total

OPT-IML-30B was fine:

28G     opt_iml_30b/max/checkpoint_1_6000.pt-model_part-0.pt
28G     opt_iml_30b/max/checkpoint_1_6000.pt-model_part-1.pt
56G     opt_iml_30b/max/singleton
112G    total

Expected Behavior

The output file size should equate to the total size of all input files.

Environment

  • metaseq Version: 0.0.1 (followed the setup page ~2 days ago)
  • PyTorch Version: 1.10.1+cu113
  • OS: Ubuntu 20.04.4 LTS
  • How you installed metaseq: pip, following this
  • Python version: 3.8.1
  • CUDA version: 11.3
@larekrow larekrow added the bug Something isn't working label Mar 31, 2023
@larekrow
Copy link
Author

larekrow commented Apr 3, 2023

Seeing #625 which said

We have internal consolidated versions for 2.7B and 30B to check against

I was hoping reshard_mp.py --num-output-parts 1 would somehow succeed for OPT-2.7B, but the same thing happens:

1.3G    opt_2b7/reshard-model_part-0.pt
1.3G    opt_2b7/reshard-model_part-1.pt
1.3G    opt_2b7/reshard-model_part-2.pt
1.3G    opt_2b7/reshard-model_part-3.pt
492K    opt_2b7/singleton
5.0G    total

@tangbinh
Copy link
Contributor

tangbinh commented Apr 5, 2023

@larekrow This is somewhat of a known issue. The checkpoints available on the OPT page aren't quite compatible with reshard_mp.py, which expects unflattened checkpoints. We'll be updating the checkpoint URLs soon (see #625).

@tangbinh tangbinh self-assigned this Apr 5, 2023
@tangbinh
Copy link
Contributor

tangbinh commented Apr 5, 2023

@larekrow The issue should have been fixed by #701. Please let us know if it you still need help.

@tangbinh tangbinh closed this as completed Apr 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants