-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
top_k
parameter of ModelCheckpoint
default value
#20539
Comments
hey @arijit-hub, I agree with you that the current behavior is counterintuitive when using If you're dealing with large models, as people are these days, removing all but the last saved checkpoints is actually desirable though, because you'd run out of disk space really quickly. For this reason I'd rather not change the default, it would be an impactful breaking change. In fact, what I do find counterintuitive is that, in order to retain all checkpoints, you need to change We could add an extra flag that makes this behavior clearer. By default you don't retain periodic checkpoints ( What do you think? |
Hye @lantiga , I agree that adding another flag would be the best solution. I guess |
Great! Would you like / have time to take a stab at it? |
Hye @lantiga , What I would do is a very simple trick! If If you agree I can send in a pull request. |
I would avoid layering this on top of I would just avoid removing the checkpoint by adding the condition here: https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/callbacks/model_checkpoint.py#L717 |
Yes, that's sounds good. Let me do a quick pull request for this. |
Description & Motivation
I believe that the
top_k
would highly benefit from the improvement in its default value. Currently, it defaults to1
. However, sometimes, like for example when I am trying to save models every n train steps, it doesn't make sense to save thetop_k=1
model. I would presumably like to have it save all the models.Pitch
So, I would suggest that the default value should be
None
. In case whereevery_n_train_steps
orevery_n_epochs
ortrain_time_interval
has some value withmonitor=None
and thetop_k
value is not given, i.e., still bearsNone
it should change to -1. However in cases where themonitor
parameter is set thetop_k
value should be changed to 1, if not specified.Alternatives
No response
Additional context
No response
cc @lantiga @Borda
The text was updated successfully, but these errors were encountered: