You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.
but in the log it tells my No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
I tried the convert_to_singleton.py but I only get the retored.pt, how could I get the *****shard0.pt file ?
here is the log
2023-02-17 07:04:55 | INFO | metaseq.utils | CUDA enviroments for all 4 workers
2023-02-17 07:04:55 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs)
2023-02-17 07:04:55 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 6.5791015625, 'gpu_1_mem_used_gb': 12.6201171875, 'gpu_2_mem_used_gb': 3.76953125, 'gpu_3_mem_used_gb': 12.6591796875, 'gpu_4_mem_used_gb': 9.486328125, 'gpu_5_mem_used_gb': 9.619140625, 'gpu_6_mem_used_gb': 9.728515625, 'gpu_7_mem_used_gb': 9.572265625}
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | loading train data for epoch 1
metaseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed metaseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:
The text was updated successfully, but these errors were encountered:
❓ Questions and Help
Before asking:
hey I downloaded the 1.3B ckpt from (https://github.com/facebookresearch/metaseq/tree/main/projects/OPT)
and I try to start finetune by this commad
opt-baselines -n 2 -g 4 -p test_v0 --model-size 1.3b --restore-file 1.3b/reshard.pt --data data-bin/ --checkpoints-dir checkpoints/ --no-save-dir --no-wandb --azure --local
but in the log it tells my No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
I tried the convert_to_singleton.py but I only get the retored.pt, how could I get the *****shard0.pt file ?
here is the log
2023-02-17 07:04:55 | INFO | metaseq.utils | CUDA enviroments for all 4 workers
2023-02-17 07:04:55 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs)
2023-02-17 07:04:55 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 6.5791015625, 'gpu_1_mem_used_gb': 12.6201171875, 'gpu_2_mem_used_gb': 3.76953125, 'gpu_3_mem_used_gb': 12.6591796875, 'gpu_4_mem_used_gb': 9.486328125, 'gpu_5_mem_used_gb': 9.619140625, 'gpu_6_mem_used_gb': 9.728515625, 'gpu_7_mem_used_gb': 9.572265625}
2023-02-17 07:04:55 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2023-02-17 07:04:55 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | No existing checkpoint found 1.3b/reshard-model_part-0-shard0.pt
2023-02-17 07:04:55 | INFO | metaseq.trainer | loading train data for epoch 1
pip
, source):The text was updated successfully, but these errors were encountered: