Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在checkpoint上继续训练,没有保存训练后的checkpint #5499

Closed
1 task done
cuisws opened this issue Sep 20, 2024 · 2 comments
Closed
1 task done

在checkpoint上继续训练,没有保存训练后的checkpint #5499

cuisws opened this issue Sep 20, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@cuisws
Copy link

cuisws commented Sep 20, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

Python version: 3.10.8
PyTorch version: 2.3.0+cu121 (GPU)

Reproduction

model

model_name_or_path: model/MiniCPM3-4B
resume_from_checkpoint: saves/checkpoint-5748

method

stage: sft
do_train: true
finetuning_type: lora

ddp

ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: sft_zh_demo
template: cpm3
cutoff_len: 1800
max_samples: 500000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves
logging_steps: 20
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true

train

训练时的batch_size

per_device_train_batch_size: 8

梯度累计次数

gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 1.0
lr_scheduler_type: cosine

预热步数

warmup_steps: 10
bf16: true

eval

val_size: 0.1

测试时的batch_size

per_device_eval_batch_size: 8
eval_strategy: steps

500个step测试一次

eval_steps: 100

Expected behavior

我使用resume_from_checkpoint从断点处开始训练,训练1轮,
训练过程中没有输出logging_step日志,训练完后没有保存新的checkpoint,这是怎么回事?
局部截取_20240920_183524

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 20, 2024
@cuisws
Copy link
Author

cuisws commented Sep 20, 2024

image

@cuisws
Copy link
Author

cuisws commented Sep 24, 2024

数据集需要是之前的,或者大于之前的

@cuisws cuisws closed this as completed Sep 24, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants