Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: preemption #25

Merged
merged 10 commits into from
Jan 19, 2024
Merged

feat: preemption #25

merged 10 commits into from
Jan 19, 2024

Conversation

vsoch
Copy link
Collaborator

@vsoch vsoch commented Jan 3, 2024

This is a WIP because there is a bug #24 with what looks to be the google storage plugin installed. Ping @johanneskoester I need to get #24 fixed before can proceed with more features here. Happy New Year!

vsoch added 3 commits January 3, 2024 15:40
Signed-off-by: vsoch <[email protected]>
still having an issue with not finding the gs flags

Signed-off-by: vsoch <[email protected]>
Copy link
Collaborator

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snakemake has now preemption parameters built in the main codebase. Hence, I don't understand why you define those options in this executor in addition. Can you clarify? Why don't you use self.workflow.remote_execution_settings.preemptible_rules.is_preemptible(job.rule.name)?

@vsoch
Copy link
Collaborator Author

vsoch commented Jan 12, 2024

I used the exact logic from the previous snakemake life sciences module, so if that is the case, it wasn't updated to support this design. Can you point me to a plugin that is using preemtible correctly?

Also note that I'm blocked from working on this not because of any issue with that, but because the snakemake command is telling me I'm missing a family of storage "gs" args.

@johanneskoester
Copy link
Collaborator

There isn't any yet. But in principle it boils down to check whether the job is supposed to be preemptible with self.workflow.remote_execution_settings.preemptible_rules.is_preemptible(job.rule.name) and then pass this info to google batch. Further, the setting self.workflow.remote_execution_settings.preemptible_retries defines the number of retries.

@vsoch
Copy link
Collaborator Author

vsoch commented Jan 15, 2024

@johanneskoester when I remove all my custom logic (and allow snakemake to install plugins it needs) I reproduce the error - it is looking for both gcs and gs. This is installing snakemake from the main branch. I can add the install of the gcs plugin, but then I'll be where I was before, looking for the gs plugin.
image

@vsoch
Copy link
Collaborator Author

vsoch commented Jan 15, 2024

ah! Just found it in an environment locally - will test removing here and seeing if that somehow (magically) removes it from the remote. That doesn't make sense, but if snakemake is getting the plugins from my local call, it actually would! Will report back.

@vsoch
Copy link
Collaborator Author

vsoch commented Jan 15, 2024

Looks like there is a bug with preemptible_rules in cli.py:

 snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 --default-storage-prefix s3://snakemake-testing-llnl --preemptible-rules hello
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1906, in args_to_api
    if not preemptible_rules:
           ^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'preemptible_rules' where it is not associated with a value

Fix here: snakemake/snakemake#2616

we need to use the snakemake provided preemtible args
and not write custom ones here.

Signed-off-by: vsoch <[email protected]>
echo "I am batch index ${BATCH_TASK_INDEX}"
export PATH=/opt/conda/bin:${PATH}
if [ $BATCH_TASK_INDEX = 0 ] && [ ! -d "/opt/conda" ] ; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about best practice here - if we should install snakemake just to the main executor (running the workflow) or as I've done, allow installing to all. For the other jobs (that ran OK) I was just installing to index 0. We can run a test again after this and see if the previous are still working OK. Intuitively unless there is an obvious shared filesystem (there isn't at /opt) I would want snakemake installed across nodes.

policy = self.get_allocation_policy(job)

# If we have preemption for the job and retry, update task retries with it
retries = self.workflow.remote_execution_settings.preemptible_retries
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task has retries, and I figure that parameter corresponds to "preemtible retries" when that is set. Note that without it, it's just the number of retries regardless of preemptible.

vsoch added 2 commits January 14, 2024 19:59
the logic for checking the accuracy of the preemptible rules
(that if one is set but not the others, and raising a workflow
error if this is off) should belong upstream as this is no
longer a parameter owned by googlebatch.

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Collaborator Author

vsoch commented Jan 15, 2024

Also with the fix for the gs, the jobs are (finally) green! I won't show you how many red / failed there are, let along that it takes 7 minutes per one step run for a hello world... 😬
image

@vsoch vsoch requested a review from johanneskoester January 15, 2024 03:34
@johanneskoester johanneskoester merged commit d6913a1 into main Jan 19, 2024
4 checks passed
@johanneskoester johanneskoester deleted the add-preemtible branch January 19, 2024 17:23
vsoch pushed a commit that referenced this pull request Feb 15, 2024
🤖 I have created a release *beep* *boop*
---

##
[0.3.0](v0.2.0...v0.3.0)
(2024-02-15)


### Features

* add simple example with cos
([#22](#22))
([2951454](2951454))
* add support for Google Batch GPUs
([#26](#26))
([f2af21c](f2af21c))
* expose network policy interfaces
([#28](#28))
([41c8d44](41c8d44))
* preemption
([#25](#25))
([d6913a1](d6913a1))
* support for boot disk type, size, and image
([#27](#27))
([d5de5a1](d5de5a1))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants