Having troubles changing save_steps parameter for a resumed job. #195

antonpolishko · 2024-08-18T21:32:43Z

I'm having next issue. Let say I'm starting a job

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml

In config_full.yaml I would have save_steps: 1000. At some point I would realize that 1000 is too frequent of a step to save, so I stop the job, edit config_full.yaml to have save_steps: 10000 and restart the job. The resume from checkpoint goes as planned, however I would still have checkpoints saved every 1000 steps (original parameter). What do I do wrong?

The text was updated successfully, but these errors were encountered:

lewtun · 2024-08-19T11:47:28Z

Hmm this is a bit strange. If you look at the logs of your second run, do you see save_steps is set to 1000 or 10_000?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having troubles changing save_steps parameter for a resumed job. #195

Having troubles changing save_steps parameter for a resumed job. #195

antonpolishko commented Aug 18, 2024

lewtun commented Aug 19, 2024

Having troubles changing save_steps parameter for a resumed job. #195

Having troubles changing save_steps parameter for a resumed job. #195

Comments

antonpolishko commented Aug 18, 2024

lewtun commented Aug 19, 2024