Ending checkpoints experiments #6104
Replies: 7 comments 36 replies
-
There is a proposed workaround in https://github.com/iterative/dvc/issues/6084#issuecomment-853323353:
This isn't ideal if I don't know the number of iterations until after starting the experiment, since I will then have to change my code mid-experiment to track the step number and exit after step |
Beta Was this translation helpful? Give feedback.
-
DVC checkpoint experiments are only indefinite if the user's stage is indefinite. Maybe I'm not fully understanding the scenarios here, but it seems to me that the user should be handling all of this themselves in their stage code.
Users can already accomplish this themselves using a param that sets the number of iterations their loop is executed (or runs forever when set to
Users can also handle this themselves in their stage code. Marking an checkpoint as "finished" just means that the stage has to actually return (or exit with 0). So if the user wants it to be "finished" once a metric value has reached some threshold/delta/etc, they should be checking that value themselves in their stage code, and then exiting the loop once the condition has been reached. I think maybe the confusion here is just that we currently only have very simple example checkpoint projects. Rright now we only document examples where we either run forever or run for a hard coded number of fixed iterations, but these are not the only ways you can design stages that call |
Beta Was this translation helpful? Give feedback.
-
Let me summarize my problem here: Suppose we have 3 stage pipeline: (1)
Now, suppose we use Instead we should have something like It may be desirable to use This makes DVC pipelines not-so-useful when using checkpoints and Keras. Either you give up on pipelines and put everything in the last step within the callback, or don't use while true ; do dvc exp run ; done |
Beta Was this translation helpful? Give feedback.
-
The problem is that there is a circle inside the pipeline, and we don't know when to jump out of it. |
Beta Was this translation helpful? Give feedback.
-
Is it too late for us to take a step back and reconsider what should be the default behavior of From the perspective of someone who is coming from common |
Beta Was this translation helpful? Give feedback.
-
In the newer So, if the experiments are run with
it runs the checkpoints indefinitely. Otherwise it runs them with a set number of epochs/checkpoints. |
Beta Was this translation helpful? Give feedback.
-
Bumping this back from the dead. We have given this a lot of time, and it still seems to me that a lot of the confusion around checkpoints would be resolved by implementing the suggestion in #6104 (reply in thread):
It would be great to get feedback on how feasible this is. If we can do it before we get to DVC 3.0 so we don't have to worry about breaking changes after that, it would be ideal. |
Beta Was this translation helpful? Give feedback.
-
In many ML and experiment tracking frameworks, users can train a model for a fixed number of checkpoints or specify when the experiment is finished. In DVC, checkpoint experiments are indefinite, which enables users to iterate on them as much as they like, but also is unintuitive and problematic in some scenarios:
n
checkpoints (https://discuss.dvc.org/t/experiments-with-checkpoint/744 and https://github.com/iterative/dvc/issues/6084#issuecomment-853323353)It seems like there needs to be a mechanism to either set the number of checkpoints or to mark the experiment as finished so that there can be a finite end to a checkpoints experiment.
Beta Was this translation helpful? Give feedback.
All reactions