Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having troubles changing save_steps parameter for a resumed job. #195

Open
antonpolishko opened this issue Aug 18, 2024 · 1 comment
Open

Comments

@antonpolishko
Copy link

I'm having next issue. Let say I'm starting a job

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml

In config_full.yaml I would have save_steps: 1000. At some point I would realize that 1000 is too frequent of a step to save, so I stop the job, edit config_full.yaml to have save_steps: 10000 and restart the job. The resume from checkpoint goes as planned, however I would still have checkpoints saved every 1000 steps (original parameter). What do I do wrong?

@lewtun
Copy link
Member

lewtun commented Aug 19, 2024

Hmm this is a bit strange. If you look at the logs of your second run, do you see save_steps is set to 1000 or 10_000?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants