We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python version: 3.10.8 PyTorch version: 2.3.0+cu121 (GPU)
model_name_or_path: model/MiniCPM3-4B resume_from_checkpoint: saves/checkpoint-5748
stage: sft do_train: true finetuning_type: lora
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z2_config.json
dataset: sft_zh_demo template: cpm3 cutoff_len: 1800 max_samples: 500000 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves logging_steps: 20 save_strategy: epoch plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 8
gradient_accumulation_steps: 8 learning_rate: 0.0001 num_train_epochs: 1.0 lr_scheduler_type: cosine
warmup_steps: 10 bf16: true
val_size: 0.1
per_device_eval_batch_size: 8 eval_strategy: steps
eval_steps: 100
我使用resume_from_checkpoint从断点处开始训练,训练1轮, 训练过程中没有输出logging_step日志,训练完后没有保存新的checkpoint,这是怎么回事?
No response
The text was updated successfully, but these errors were encountered:
Sorry, something went wrong.
数据集需要是之前的,或者大于之前的
No branches or pull requests
Reminder
System Info
Python version: 3.10.8
PyTorch version: 2.3.0+cu121 (GPU)
Reproduction
model
model_name_or_path: model/MiniCPM3-4B
resume_from_checkpoint: saves/checkpoint-5748
method
stage: sft
do_train: true
finetuning_type: lora
ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json
dataset
dataset: sft_zh_demo
template: cpm3
cutoff_len: 1800
max_samples: 500000
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves
logging_steps: 20
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
train
训练时的batch_size
per_device_train_batch_size: 8
梯度累计次数
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 1.0
lr_scheduler_type: cosine
预热步数
warmup_steps: 10
bf16: true
eval
val_size: 0.1
测试时的batch_size
per_device_eval_batch_size: 8
eval_strategy: steps
500个step测试一次
eval_steps: 100
Expected behavior
我使用resume_from_checkpoint从断点处开始训练,训练1轮,
训练过程中没有输出logging_step日志,训练完后没有保存新的checkpoint,这是怎么回事?
Others
No response
The text was updated successfully, but these errors were encountered: