Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

how to get sharded ckpt #653

Open
laozhanghahaha opened this issue Feb 17, 2023 · 4 comments
Open

how to get sharded ckpt #653

laozhanghahaha opened this issue Feb 17, 2023 · 4 comments
Labels
question Further information is requested

Comments

@laozhanghahaha
Copy link

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

hey I downloaded the 1.3B ckpt from (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)

and I try to start finetune by this commad

opt-baselines -n 2 -g 4 -p test_v0 --model-size 1.3b --restore-file 1.3b/reshard.pt --data data-bin/ --checkpoints-dir checkpoints/ --no-save-dir --no-wandb --azure --local

but in the log it tells my No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt

I tried the convert_to_singleton.py but I only get the retored.pt, how could I get the *****shard0.pt file ?

here is the log

2023-02-17 07:04:55 | INFO | metaseq.utils | CUDA enviroments for all 4 workers
2023-02-17 07:04:55 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs)
2023-02-17 07:04:55 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 6.5791015625, 'gpu_1_mem_used_gb': 12.6201171875, 'gpu_2_mem_used_gb': 3.76953125, 'gpu_3_mem_used_gb': 12.6591796875, 'gpu_4_mem_used_gb': 9.486328125, 'gpu_5_mem_used_gb': 9.619140625, 'gpu_6_mem_used_gb': 9.728515625, 'gpu_7_mem_used_gb': 9.572265625}
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | loading train data for epoch 1

  • metaseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed metaseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:
@laozhanghahaha laozhanghahaha added the question Further information is requested label Feb 17, 2023
@wxthu
Copy link

wxthu commented Feb 27, 2023

--data data-bin
I want to know where I can get data-bin

@laozhanghahaha
Copy link
Author

@wxthu mkdir, then put the data in that folder

@wxthu
Copy link

wxthu commented Feb 27, 2023

@wxthu mkdir, then put the data in that folder
dataset such as GLUE ? I am new to NLP ...

@laozhanghahaha
Copy link
Author

@wxthu your dataset should look like this

def load_dataset(self, split: str, epoch=1, combine=False, **kwargs):

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants