Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature exp run: Dryer resume within the CI #6823

Closed
DavidGOrtega opened this issue Oct 18, 2021 · 4 comments
Closed

Feature exp run: Dryer resume within the CI #6823

DavidGOrtega opened this issue Oct 18, 2021 · 4 comments
Labels
A: experiments Related to dvc exp enhancement Enhances DVC p2-medium Medium priority, should be done, but less important

Comments

@DavidGOrtega
Copy link

DavidGOrtega commented Oct 18, 2021

Issue

In the CI, to be able to resume training with preexisting checkpoints we have to make something like:

EXP_NAME=cml-run-${GITHUB_SHA}
EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')
if [[ -z "$EXP_AVAIL" ]]; then
    echo "############\nFirst Time\n############"
    dvc exp run -n $EXP_NAME --pull -v
else    
    echo "############\nResuming\n############"
    dvc exp apply $EXP_NAME
    dvc exp run -v
fi

Would be nice if we had:

  • a flag with dvc exp run -n $EXP_NAME to be able to pull and apply

So it would become:

EXP_NAME=cml-run-${GITHUB_SHA}
dvc exp run -n $EXP_NAME --pull-apply -v

Additional issue

Please note:

EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')

This is because dvc exp pull --run-cache origin $EXP_NAME will throw an error in no prev experiments are present

@DavidGOrtega DavidGOrtega changed the title exp run: Better resume within the CI exp run: Dryer resume within the CI Oct 18, 2021
@DavidGOrtega DavidGOrtega changed the title exp run: Dryer resume within the CI Feature exp run: Dryer resume within the CI Oct 18, 2021
@daavoo daavoo added the A: experiments Related to dvc exp label Oct 18, 2021
@casperdcl
Copy link
Contributor

casperdcl commented Mar 19, 2022

Posting old message before it gets lost: upshot of auto-pull checkpoints, we need to

  • exp pull && exp apply
  • exp run: specify an experiment name first time but not when resuming
EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr

if [[ $(dvc exp pull --run-cache origin $EXP_NAME &>/dev/null) ]]; then
  echo "# resuming interrupted experiment"
  dvc exp apply $EXP_NAME
  DVC_EXP_AUTO_PUSH=1
  DVC_EXP_GIT_REMOTE=origin dvc exp run ...
else
  echo "# first time running experiment"
  DVC_EXP_AUTO_PUSH=1
  DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...
fi

@dberenbaum
Copy link
Collaborator

  • exp run: specify an experiment name first time but not when resuming

One minor note: You should be able to use dvc exp run -n $EXP_NAME even when resuming experiments. It succeeds but generates a warning like WARNING: Ignoring option '--name exp-c1734' for resumed experiment. Existing experiment name will be preserved instead.

With that in mind, the workflow can be something like:

EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr
dvc exp pull --run-cache origin $EXP_NAME || true
dvc exp apply $EXP_NAME || true
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...

@casperdcl
Copy link
Contributor

also related: iterative/example-repos-dev#83 (comment)

@dberenbaum dberenbaum added the p2-medium Medium priority, should be done, but less important label Feb 17, 2023
@skshetry skshetry removed the cml label May 12, 2023
@dberenbaum
Copy link
Collaborator

Closing since checkpoints have been deprecated. For discussion about resuming experiments, see iterative/dvclive#505.

@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp enhancement Enhances DVC p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

5 participants