Ending checkpoints experiments #6104

dberenbaum · 2021-06-02T19:57:35Z

dberenbaum
Jun 2, 2021
Collaborator

In many ML and experiment tracking frameworks, users can train a model for a fixed number of checkpoints or specify when the experiment is finished. In DVC, checkpoint experiments are indefinite, which enables users to iterate on them as much as they like, but also is unintuitive and problematic in some scenarios:

When there are stages downstream of the checkpoints stage (https://discuss.dvc.org/t/experiments-with-checkpoint/744 and get-started-checkpoints: Convert checkpoints' examples to indefinite-training example-repos-dev#43 (comment))
When the experiment is interrupted and I want to continue until I reach n checkpoints (https://discuss.dvc.org/t/experiments-with-checkpoint/744 and https://github.com/iterative/dvc/issues/6084#issuecomment-853323353)
When my data or other dependency changes and I want to reproduce my checkpoints experiment with the same number of checkpoints as before

It seems like there needs to be a mechanism to either set the number of checkpoints or to mark the experiment as finished so that there can be a finite end to a checkpoints experiment.

dberenbaum · 2021-06-02T20:03:08Z

dberenbaum
Jun 2, 2021
Collaborator Author

There is a proposed workaround in https://github.com/iterative/dvc/issues/6084#issuecomment-853323353:

Ah, so your jobs may be cancelled before they run to completion. In that case, instead of iterating over a constant range of steps/epochs like in dvc-checkpoints-mnist, you want to track the step number by reading in dvclive.json (or whatever path you are using for dvclive) and iterate until that step number is reached, right? That should also allow you to dvc exp run without having to manually run the downstream stages separately.

This isn't ideal if I don't know the number of iterations until after starting the experiment, since I will then have to change my code mid-experiment to track the step number and exit after step n. There are also probably much more intuitive ways dvc could handle this.

1 reply

DavidGOrtega Jun 2, 2021

instead of iterating over a constant range of steps/epochs like in dvc-checkpoints-mnist,
I don't know the number of iterations until after starting the experiment

I think that might be the most common scenario, at least the one used in every paper that I have read. You could stop it also checking the loss change, etc... but how do you predict whats going to be that change in advance and the bottom it will peak... Its the same dog, different collar

pmrowla · 2021-06-03T00:36:44Z

pmrowla
Jun 3, 2021

In DVC, checkpoint experiments are indefinite, which enables users to iterate on them as much as they like, but also is unintuitive and problematic in some scenarios

DVC checkpoint experiments are only indefinite if the user's stage is indefinite. Maybe I'm not fully understanding the scenarios here, but it seems to me that the user should be handling all of this themselves in their stage code.

It seems like there needs to be a mechanism to either set the number of checkpoints

Users can already accomplish this themselves using a param that sets the number of iterations their loop is executed (or runs forever when set to 0/None/whatever the user decides)

or to mark the experiment as finished so that there can be a finite end to a checkpoints experiment.

Users can also handle this themselves in their stage code. Marking an checkpoint as "finished" just means that the stage has to actually return (or exit with 0). So if the user wants it to be "finished" once a metric value has reached some threshold/delta/etc, they should be checking that value themselves in their stage code, and then exiting the loop once the condition has been reached.

I think maybe the confusion here is just that we currently only have very simple example checkpoint projects. Rright now we only document examples where we either run forever or run for a hard coded number of fixed iterations, but these are not the only ways you can design stages that call make_checkpoint().

9 replies

jorgeorpinel Jun 10, 2021

right now we only document examples where we either run forever or run for a hard coded number of fixed iterations, but these are not the only ways you can design stages that call make_checkpoint().

Should we add a checkbox to iterative/dvc.org#2432 or create a new issue about documenting a finite checkpoint stage? Sounds like it may even be the most common scenario from some of the comments here.

dberenbaum Jun 11, 2021
Collaborator Author

🤔 Those bullets are about different ways to implement checkpoints, whereas this is more about different use cases, so we can either list it separately in that issue or create a new one.

dberenbaum Jul 16, 2021
Collaborator Author

Marking an checkpoint as "finished" just means that the stage has to actually return (or exit with 0). So if the user wants it to be "finished" once a metric value has reached some threshold/delta/etc, they should be checking that value themselves in their stage code, and then exiting the loop once the condition has been reached.

This makes perfect sense, but it doesn't quite match my experience with checkpoints now. Once a stage completes cleanly, I wouldn't expect it to run again unless its dependencies have changed (like any other stage). Instead, it still runs and we leave it to the user to ensure that their code is idempotent and doesn't make any changes when run again.

Why do we still run the stage again if it has completed cleanly and dependencies haven't changed? Is it because we assume the model file is a circular dependency? The model file might be better treated like a persistent output than a circular dependency. I would not expect my stage to be triggered by whether that model file changed.

cc @daavoo

daavoo Aug 19, 2021

This makes perfect sense, but it doesn't quite match my experience with checkpoints now. Once a stage completes cleanly, I wouldn't expect it to run again unless its dependencies have changed (like any other stage). Instead, it still runs and we leave it to the user to ensure that their code is idempotent and doesn't make any changes when run again.

Agree with this.

At the very least, I would expect dvc exp run --reset to don't rely on the idempotence of the users code and don't run the "checkpoint" stage again if there are no changes in deps/params.

daavoo Aug 23, 2021

Opened #6472

iesahin · 2021-06-08T14:41:07Z

iesahin
Jun 8, 2021

Let me summarize my problem here:

Suppose we have 3 stage pipeline: (1) prepare (2) train (3) evaluate.

train stage produces the model, evaluate stage produces the metrics.

Now, suppose we use make_checkpoint in (2) after each epoch. dvc exp run will reset the environment for the next epoch before metrics are produced by evaluate.

Instead we should have something like begin-checkpoint in train and end-checkpoint in evaluate.

It may be desirable to use make-checkpoint in evaluate, (or in any final stage) but Keras uses callbacks to store the models after each epoch, and evaluate should be performed within the callback.

This makes DVC pipelines not-so-useful when using checkpoints and Keras. Either you give up on pipelines and put everything in the last step within the callback, or don't use make_checkpoint, and use something like with basic checkpoints.

while true ; do dvc exp run ; done

9 replies

iesahin Jun 29, 2021

I thought I've replied this, sorry for the delay.

AFAIU, checkpoints (with make_checkpoint() or signal file) moves experiment artifacts to .git/objects/exp/12/2938.../exp-20398/ directory.

When we have an artifact in an earlier stage, e.g., train and I want to use checkpoints at that stage, the later stage, e.g., evaluate can't be run because all the artifacts are moved from the workspace.

If train produces model.h5, and uses make_checkpoint, then model.h5 is moved to git/objects/exp... and evaluate cannot find model.h5.

make_checkpoint should wait for all stages to be run before moving artifacts to the .git/objects/exp tree.

pmrowla Jun 29, 2021

That's not how experiments work, checkpoints made with make_checkpoint() are just regular git commits, that are identical to any other git commit that you would make with CLI git commit.

The only unique thing about experiments, is that the named reference to a git commit we create is stored in refs/exps/.... Whereas the typical named references to a git commit you would make in CLI git are stored in refs/heads (for Git branches) and refs/tags (for Git tags). Experiments are functionally identical to Git tags and branches.

And artifacts aren't moved anywhere at all when a stage completes, they are left in the workspace the same as they would be for any other dvc repro or dvc exp run pipeline run.

iesahin Jun 29, 2021

I had the error in iterative/example-repos-dev#47 while running an earlier version of https://github.com/iterative/get-started-checkpoints . I wasn't able to use the checkpoints outside of the callback, because some artifacts were lost in the workspace.

Hence, my interpretation of the checkpoints was like above, moving the changed artifacts to the tree. Then I'll need to check why those artifacts was lost and if checkpoints really cause this. Thank you.

dberenbaum Jun 29, 2021
Collaborator Author

@pmrowla Are dvc-tracked outputs being cached at each checkpoint also? I thought so, but I'm not sure anymore.

pmrowla Jun 30, 2021

@dberenbaum yes, dvc-tracked outputs are cached at each checkpoint (or at least they are supposed to be)

it sounds like @iesahin may have encountered a bug if artifacts were actually disappearing, but it's hard to tell without an actual reproducible example

karajan1001 · 2021-07-03T14:27:28Z

karajan1001
Jul 3, 2021

The problem is that there is a circle inside the pipeline, and we don't know when to jump out of it.
In my opinion, maybe we need a if statement to control the pipeline. Steps/ epochs are not enough,early stopping might be used to prevent overfitting.

8 replies

dberenbaum Jul 6, 2021
Collaborator Author

Good discussion!

In this example, training should be exit after epoch 6 and then returns to epoch 4 because after this iterative loss stayed unchanged.

This makes sense since fewer epochs would usually be preferable, but is this something you have seen done? For example, keras by default would keep epoch 6, or there's a restore_best_weights option so that it would return to epoch 4 only if it was better than epochs 5 and 6: https://keras.io/api/callbacks/early_stopping/. I think keeping epoch 6 would probably be sufficient to start, although we could work on more advanced scenarios once we have that implemented.

The loss, accuracy and early stopping criteria are usually domain dependent.

Looking at the keras callback above, I think integrating with callbacks like these or making a similar one for dvclive would make sense and hopefully cover most cases.

karajan1001 Jul 6, 2021

Yes, I used to chose 4, just follow the rule of Occam's razor Entities are not to be multiplied without necessity. At epoch 4, we already get the lowerest loss/bias while epochs 6 would give a higher variance.

karajan1001 Jul 6, 2021

In real cases, the loss of epoch 6 would be slightly smaller than epoch 4.
It is both usual ways to chose epoch 4 ( lowest variance and acceptable bias) or the lowest loss (lowest bias + variance) epoch.

dberenbaum Jul 14, 2021
Collaborator Author

cc @daavoo

daavoo Jul 14, 2021

Looking at the keras callback above, I think integrating with callbacks like these or making a similar one for dvclive would make sense and hopefully cover most cases.

Interestingly, that's exactly what the mlflow<>keras integration does:

https://github.com/mlflow/mlflow/blob/master/mlflow/keras.py#L765

daavoo · 2021-08-20T09:57:39Z

daavoo
Aug 20, 2021

dvc checkpoints are currently only suitable for the "indefinite training" workflow, which is not as widespread as other workflows. My overall feeling is that we would need to do a significant amount of changes to properly support more common workflows (like the 3 bullets listed at the beginning of the discussion).

Is it too late for us to take a step back and reconsider what should be the default behavior of dvc checkpoints?

From the perspective of someone who is coming from common dvc repro + ML checkpoints workflows, I would expect the "indefinite training" workflow to be an optional feature, not the default behavior.

1 reply

dberenbaum Aug 27, 2021
Collaborator Author

@daavoo I think you mentioned somewhere that all checkpoints could be encapsulated within a single stage run rather each representing a distinct stage run. I'm not sure where that was, but do you think that would help for more typical workflows?

In that case, completed stages would not be run a again, but any stage that doesn't successfully complete would be. I'm not sure how hard it would be to commit checkpoints in the middle of running a stage.

iesahin · 2021-08-31T14:38:25Z

iesahin
Aug 31, 2021

In the newer dvc-example-checkpoints-tensorflow project which evolved from get-started-experiments, I'm using something like:

So, if the experiments are run with

dvc exp run -S train.epochs=0

it runs the checkpoints indefinitely. Otherwise it runs them with a set number of epochs/checkpoints.

2 replies

dberenbaum Sep 1, 2021
Collaborator Author

Have you seen the new https://dvc.org/doc/dvclive/api-reference/get_step?

daavoo Sep 3, 2021

One thing to note, it seems that the code is not loading the model weights (if already exists) so if you interrupt and resume the indefinite training you won't be really resuming: iterative/dvc.org#2742

dberenbaum · 2021-12-17T21:27:38Z

dberenbaum
Dec 17, 2021
Collaborator Author

Bumping this back from the dead. We have given this a lot of time, and it still seems to me that a lot of the confusion around checkpoints would be resolved by implementing the suggestion in #6104 (reply in thread):

Once a stage completes cleanly, I wouldn't expect it to run again unless its dependencies have changed (like any other stage).

Why do we still run the stage again if it has completed cleanly and dependencies haven't changed? Is it because we assume the model file is a circular dependency? The model file might be better treated like a persistent output than a circular dependency. I would not expect my stage to be triggered by whether that model file changed.

It would be great to get feedback on how feasible this is. If we can do it before we get to DVC 3.0 so we don't have to worry about breaking changes after that, it would be ideal.

6 replies

pmrowla Dec 23, 2021

My understanding of the new expected behavior is that:

If I exp run 10 checkpoint epochs and then stop, the next exp run command should do nothing.

This can be done, but I think we still need to clarify some further questions about the desired behavior:

How does this affect reproducibility?
- After the initial run, dvc.lock will contain a dependency state mapped to a "post-10-epochs" output state. DVC's run-cache would also contain an entry mapping the initial dependency state to the "post-10-epochs" output state.
- If another user does a fresh clone of the repo, dvc pulls run-cache, and then tries to do dvc exp run with that same dependency state, is the expected result that the stage would be re-run (and 10 checkpoints would be generated)? Or is the expected result that DVC would use the run-cache entry and just generate a single exp commit containing the run-cached "post-10-epochs" output state?
- Should run-cache be used at all in checkpoints scenarios?
Will checkpoints still be resumable?
If the user changes the dependency state, does DVC re-run the entire thing from scratch using the new dependency state, or does DVC resume from the result of the initial run?

2 and 3 may be easier to think about with a params dep example. Let's say I have the # of epochs to run defined as an actual DVC-tracked parameter epochs. So for my initial run (to get 10 checkpoints), I did dvc exp run -S epochs=10. If I now decide I want to run 10 more (to get a total of 20 checkpoints), what does that look like?

If I do dvc exp run -S epochs=20 from this state, does DVC start from scratch, and run 20 brand-new checkpoints?
- In this case, DVC now has two separate mappings of dependency->output states:
  1. After initial run: epochs=10 -> result after 10 checkpoints
  2. After follow-up run: epochs=20 -> result after 20 checkpoints
If I do dvc exp run -S epochs=20 from this state, does DVC continue from the current state and run 20 additional checkpoints?
- In this case, DVC now has two separate mappings of dependency->output states:
  1. After initial run: epochs=10 -> result after 10 checkpoints
  2. After follow-up run: epochs=20 -> result after 30 (10 + 20) checkpoints
Do we (re-)add a separate/explicit --resume flag so I can do something like dvc exp run --resume -S epochs=10?
- In this case, the "dependency state" as DVC sees it has not changed (the parameter value epochs=10 has not changed at all). Rather, the dep -> output mapping has been overwritten:
  1. After initial run: epochs=10 -> result after 10 checkpoints
  2. After follow-up run: epochs=10 -> result after 20 (10 + 10) checkpoints

These considerations are avoided in the current DVC behavior, since the checkpoint model itself is essentially also part of the "dependency state". So after every checkpoint, we have a new/changed dependency state that is used in the next checkpoint iteration (i.e. checkpoints are always_changed and never considered "reproducible")

@dberenbaum

dberenbaum Dec 30, 2021
Collaborator Author

Here's the basic proposal of how I see this working:

Map initial dependency state to final output state in the run-cache.
Don't save to the run-cache until the experiment "completes."
Save the state of each checkpoint in the cache but not the run-cache.

If another user does a fresh clone of the repo, dvc pulls run-cache, and then tries to do dvc exp run with that same dependency state, is the expected result that the stage would be re-run (and 10 checkpoints would be generated)? Or is the expected result that DVC would use the run-cache entry and just generate a single exp commit containing the run-cached "post-10-epochs" output state?

The latter makes sense to me. This is what would happen with a non-checkpoint experiment, right? Ideally, if the user pulls the experiment in addition to the run-cache, then they would get a conflict and be referred back to the original experiment, right?

Will checkpoints still be resumable?

For interrupted experiments, they can behave the same as now, preferably resuming from the last checkpoint. "Completed" experiments would not be resumable.

If the user changes the dependency state, does DVC re-run the entire thing from scratch using the new dependency state, or does DVC resume from the result of the initial run?

For interrupted experiments, they can behave the same as now, resuming from the result of the initial run. Once the experiment completes successfully, then DVC should no longer resume that experiment and should start a new experiment from scratch if the dependency state changes (or refuse to run without -f if the dependency state has not changed).

If I do dvc exp run -S epochs=20 from this state, does DVC start from scratch, and run 20 brand-new checkpoints?

Yes, since this would match the non-checkpoints behavior. If a user wants to extend experiments manually until finding the right number of epochs, I would suggest not setting epochs as a parameter since that implies a fixed, predetermined number of epochs.

pmrowla Jan 4, 2022

For interrupted experiments, they can behave the same as now, preferably resuming from the last checkpoint. "Completed" experiments would not be resumable.

For interrupted experiments, they can behave the same as now, resuming from the result of the initial run. Once the experiment completes successfully, then DVC should no longer resume that experiment

What is the definition of "interrupted experiments" here? (Does "interrupted stage" == "failed stage"?)

Currently DVC does not distinguish between successful (stage command exits with status 0) and failed (stage command exits with any non-zero status) runs with regard to checkpoints (and everything is resumable).

If we define "interrupted" to mean "exits with non-zero status", it will break the existing "run-forever until ctrl-c" example use case. If the user's code handle's SIGINT "properly" and the stage exits with status 0 when the user presses ctrl-c, DVC will count that as a "successful" run, meaning that this case is no longer resumable.

dberenbaum Jan 4, 2022
Collaborator Author

What is the definition of "interrupted experiments" here? (Does "interrupted stage" == "failed stage"?)

Yeah, that's what I meant.

If we define "interrupted" to mean "exits with non-zero status", it will break the existing "run-forever until ctrl-c" example use case. If the user's code handle's SIGINT "properly" and the stage exits with status 0 when the user presses ctrl-c, DVC will count that as a "successful" run, meaning that this case is no longer resumable.

Doesn't this simplify things since this use case will no longer need to catch the SIGINT exception?

pmrowla Jan 11, 2022

Yes, but it's something we'll have to document (since if users were handling SIGINT themselves before, they'll have to update their stages accordingly after this proposed behavior change)

Ending checkpoints experiments #6104

dberenbaum Jun 2, 2021 Collaborator

Replies: 7 comments · 36 replies

dberenbaum Jun 2, 2021 Collaborator Author

dberenbaum Jun 11, 2021 Collaborator Author

dberenbaum Jul 16, 2021 Collaborator Author

dberenbaum Jun 29, 2021 Collaborator Author

dberenbaum Jul 6, 2021 Collaborator Author

dberenbaum Jul 14, 2021 Collaborator Author

dberenbaum Aug 27, 2021 Collaborator Author

dberenbaum Sep 1, 2021 Collaborator Author

dberenbaum Dec 17, 2021 Collaborator Author

dberenbaum Dec 30, 2021 Collaborator Author

dberenbaum Jan 4, 2022 Collaborator Author