在checkpoint上继续训练，没有保存训练后的checkpint #5499

cuisws · 2024-09-20T10:42:36Z

Reminder

I have read the README and searched the existing issues.

System Info

Python version: 3.10.8
PyTorch version: 2.3.0+cu121 (GPU)

Reproduction

model

model_name_or_path: model/MiniCPM3-4B
resume_from_checkpoint: saves/checkpoint-5748

method

stage: sft
do_train: true
finetuning_type: lora

ddp

ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: sft_zh_demo
template: cpm3
cutoff_len: 1800
max_samples: 500000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves
logging_steps: 20
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true

train

训练时的batch_size

per_device_train_batch_size: 8

梯度累计次数

gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 1.0
lr_scheduler_type: cosine

预热步数

warmup_steps: 10
bf16: true

eval

val_size: 0.1

测试时的batch_size

per_device_eval_batch_size: 8
eval_strategy: steps

500个step测试一次

eval_steps: 100

Expected behavior

我使用resume_from_checkpoint从断点处开始训练，训练1轮，
训练过程中没有输出logging_step日志，训练完后没有保存新的checkpoint，这是怎么回事？

Others

No response

cuisws · 2024-09-20T12:07:57Z

cuisws · 2024-09-24T14:04:33Z

数据集需要是之前的，或者大于之前的

github-actions bot added the pending This problem is yet to be addressed label Sep 20, 2024

cuisws closed this as completed Sep 24, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在checkpoint上继续训练，没有保存训练后的checkpint #5499

在checkpoint上继续训练，没有保存训练后的checkpint #5499

cuisws commented Sep 20, 2024

cuisws commented Sep 20, 2024

cuisws commented Sep 24, 2024

在checkpoint上继续训练，没有保存训练后的checkpint #5499

在checkpoint上继续训练，没有保存训练后的checkpint #5499

Comments

cuisws commented Sep 20, 2024

Reminder

System Info

Reproduction

model

method

ddp

dataset

output

train

训练时的batch_size

梯度累计次数

预热步数

eval

测试时的batch_size

500个step测试一次

Expected behavior

Others

cuisws commented Sep 20, 2024

cuisws commented Sep 24, 2024