how to get sharded ckpt #653

laozhanghahaha · 2023-02-17T11:51:03Z

❓ Questions and Help

Before asking:

search the issues.
search the docs.

hey I downloaded the 1.3B ckpt from (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)

and I try to start finetune by this commad

opt-baselines -n 2 -g 4 -p test_v0 --model-size 1.3b --restore-file 1.3b/reshard.pt --data data-bin/ --checkpoints-dir checkpoints/ --no-save-dir --no-wandb --azure --local

but in the log it tells my No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt

I tried the convert_to_singleton.py but I only get the retored.pt, how could I get the *****shard0.pt file ?

here is the log

2023-02-17 07:04:55 | INFO | metaseq.utils | CUDA enviroments for all 4 workers
2023-02-17 07:04:55 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs)
2023-02-17 07:04:55 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 6.5791015625, 'gpu_1_mem_used_gb': 12.6201171875, 'gpu_2_mem_used_gb': 3.76953125, 'gpu_3_mem_used_gb': 12.6591796875, 'gpu_4_mem_used_gb': 9.486328125, 'gpu_5_mem_used_gb': 9.619140625, 'gpu_6_mem_used_gb': 9.728515625, 'gpu_7_mem_used_gb': 9.572265625}
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | loading train data for epoch 1

metaseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed metaseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

The text was updated successfully, but these errors were encountered:

wxthu · 2023-02-27T10:40:18Z

--data data-bin
I want to know where I can get data-bin

laozhanghahaha · 2023-02-27T10:52:15Z

@wxthu mkdir, then put the data in that folder

wxthu · 2023-02-27T10:53:38Z

@wxthu mkdir, then put the data in that folder
dataset such as GLUE ? I am new to NLP ...

laozhanghahaha · 2023-02-27T12:09:53Z

@wxthu your dataset should look like this

metaseq/metaseq/tasks/streaming_language_modeling.py

Line 287 in b47f8d1

def load_dataset(self, split: str, epoch=1, combine=False, **kwargs):

laozhanghahaha added the question Further information is requested label Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get sharded ckpt #653

how to get sharded ckpt #653

laozhanghahaha commented Feb 17, 2023

wxthu commented Feb 27, 2023

laozhanghahaha commented Feb 27, 2023

wxthu commented Feb 27, 2023

laozhanghahaha commented Feb 27, 2023

how to get sharded ckpt #653

how to get sharded ckpt #653

Comments

laozhanghahaha commented Feb 17, 2023

❓ Questions and Help

Before asking:

wxthu commented Feb 27, 2023

laozhanghahaha commented Feb 27, 2023

wxthu commented Feb 27, 2023

laozhanghahaha commented Feb 27, 2023