Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top_k parameter of ModelCheckpoint default value #20539

Open
arijit-hub opened this issue Jan 9, 2025 · 6 comments · May be fixed by #20547
Open

top_k parameter of ModelCheckpoint default value #20539

arijit-hub opened this issue Jan 9, 2025 · 6 comments · May be fixed by #20547
Labels
feature Is an improvement or enhancement

Comments

@arijit-hub
Copy link

arijit-hub commented Jan 9, 2025

Description & Motivation

I believe that the top_k would highly benefit from the improvement in its default value. Currently, it defaults to 1. However, sometimes, like for example when I am trying to save models every n train steps, it doesn't make sense to save the top_k=1 model. I would presumably like to have it save all the models.

Pitch

So, I would suggest that the default value should be None. In case where every_n_train_steps or every_n_epochs or train_time_interval has some value with monitor=None and the top_k value is not given, i.e., still bears None it should change to -1. However in cases where the monitor parameter is set the top_k value should be changed to 1, if not specified.

Alternatives

No response

Additional context

No response

cc @lantiga @Borda

@arijit-hub arijit-hub added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Jan 9, 2025
@lantiga
Copy link
Collaborator

lantiga commented Jan 13, 2025

hey @arijit-hub, I agree with you that the current behavior is counterintuitive when using every_n_train_steps (or epochs etc), in case I want all checkpoints and not just the last one.

If you're dealing with large models, as people are these days, removing all but the last saved checkpoints is actually desirable though, because you'd run out of disk space really quickly. For this reason I'd rather not change the default, it would be an impactful breaking change.

In fact, what I do find counterintuitive is that, in order to retain all checkpoints, you need to change save_top_k to -1. The fact that a checkpoint saved periodically (with every_n_train_steps or else) falls under top_k logic at all is counterintuitive.

We could add an extra flag that makes this behavior clearer. By default you don't retain periodic checkpoints (retain_periodic_checkpoints=False, we can find a better name ofc), and you can flip the switch and retain them.

What do you think?

@lantiga lantiga removed the needs triage Waiting to be triaged by maintainers label Jan 13, 2025
@arijit-hub
Copy link
Author

Hye @lantiga , I agree that adding another flag would be the best solution. I guess retain_periodic_checkpoints already sounds good!

@lantiga
Copy link
Collaborator

lantiga commented Jan 14, 2025

Great! Would you like / have time to take a stab at it?

@arijit-hub
Copy link
Author

arijit-hub commented Jan 14, 2025

Hye @lantiga ,

What I would do is a very simple trick! If retain_periodic_checkpoint is set to True I will just set self.top_k=-1 in the constructor. This wont break anything and wont need any extra work. What do you think?

If you agree I can send in a pull request.

@lantiga
Copy link
Collaborator

lantiga commented Jan 14, 2025

I would avoid layering this on top of save_top_k, it would make the implementation quite hard to understand

I would just avoid removing the checkpoint by adding the condition here: https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/callbacks/model_checkpoint.py#L717

@arijit-hub
Copy link
Author

Yes, that's sounds good. Let me do a quick pull request for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
2 participants