-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added GPU node pool support for GKE, better logging, error and exception handling #31
base: main
Are you sure you want to change the base?
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe changes in this pull request focus on enhancing the functionality of the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
snakemake_executor_plugin_kubernetes/__init__.py (6)
189-189
: Simplify dictionary key existence checkYou can simplify the condition by removing the
.keys()
call when checking if a key exists in a dictionary.Apply this diff to simplify the key check:
-if "machine_type" in job.resources.keys(): +if "machine_type" in job.resources:
260-260
: Simplify dictionary key existence checkYou can simplify the condition by removing the
.keys()
call when checking if a key exists in a dictionary.Apply this diff to simplify the key check:
-if "mem_mb" in job.resources.keys(): +if "mem_mb" in job.resources:🧰 Tools
🪛 Ruff
260-260: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
267-267
: Simplify dictionary key existence checkYou can simplify the condition by removing the
.keys()
call when checking if a key exists in a dictionary.Apply this diff to simplify the key check:
-if "disk_mb" in job.resources.keys(): +if "disk_mb" in job.resources:🧰 Tools
🪛 Ruff
267-267: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
273-273
: Simplify dictionary key existence checkYou can simplify the condition by removing the
.keys()
call when checking if a key exists in a dictionary.Apply this diff to simplify the key check:
-if "gpu" in job.resources.keys(): +if "gpu" in job.resources:🧰 Tools
🪛 Ruff
273-273: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
302-304
: Preserve exception context when re-raisingWhen re-raising an exception in an
except
block, it's best practice to preserve the original exception context by usingfrom e
.Apply this diff to maintain the exception context:
except kubernetes.client.rest.ApiException as e: self.logger.error(f"Failed to create pod: {e}") - raise WorkflowError(f"Failed to create pod: {e}") + raise WorkflowError(f"Failed to create pod: {e}") from e🧰 Tools
🪛 Ruff
304-304: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
193-196
: Consider making node selector labels configurableHardcoding the node selector label
"accelerator": "nvidia"
may not be flexible for different cluster configurations. Consider making the node selector labels for GPU nodes configurable to accommodate various environments.This will enhance the portability of your code across different Kubernetes clusters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py
(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff
snakemake_executor_plugin_kubernetes/__init__.py
187-187: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
260-260: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
267-267: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
273-273: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
304-304: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
205-217
: Confirm toleration for GPU nodes is appropriate
The toleration is added for key "nvidia.com/gpu"
with value "present"
. Please confirm that this toleration aligns with the taints applied to GPU nodes in your Kubernetes cluster.
Run the following script to list node taints and verify GPU node taints:
#!/bin/bash
# Description: List taints on nodes to verify the correct taint for GPU nodes
# List all nodes with their taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
snakemake_executor_plugin_kubernetes/__init__.py (4)
179-183
: Consider adding mount path validationThe persistent volume mount implementation could benefit from path validation to ensure the paths are absolute and don't contain problematic characters.
for pvc in self.persistent_volumes: + path_str = str(pvc.path) + if not path_str.startswith('/'): + raise ValueError(f"Mount path must be absolute: {path_str}") container.volume_mounts.append( - kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=str(pvc.path)) + kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=path_str) )
247-278
: Add validation for resource valuesWhile the resource allocation logic is correct, consider adding validation to ensure resource values are within reasonable bounds.
+def validate_resource_value(value: int, resource_type: str, min_value: int = 1): + if value < min_value: + raise ValueError(f"{resource_type} request must be >= {min_value}, got {value}") # CPU and memory requests +validate_resource_value(int(job.resources["_cores"]), "CPU cores") container.resources.requests["cpu"] = "{}m".format( int(job.resources["_cores"] * self.k8s_cpu_scalar * 1000) ) if "mem_mb" in job.resources.keys(): + validate_resource_value(job.resources["mem_mb"], "Memory (MB)") container.resources.requests["memory"] = "{}M".format( job.resources["mem_mb"] ) if "gpu" in job.resources.keys(): + validate_resource_value(int(job.resources["gpu"]), "GPU count") gpu_count = str(job.resources["gpu"])🧰 Tools
🪛 Ruff
260-260: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
267-267: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
273-273: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Line range hint
349-391
: Consider using structured logging for better observabilityThe error handling and status reporting look good, but consider using structured logging for better parsing and monitoring.
-self.logger.error(f"ApiException when checking pod status: {e}") +self.logger.error("Pod status check failed", extra={ + "error_type": "ApiException", + "pod_id": j.external_jobid, + "error_message": str(e), + "status_code": e.status +}) -self.logger.info(f"Job {j.external_jobid} succeeded.") +self.logger.info("Job completed successfully", extra={ + "pod_id": j.external_jobid, + "duration": res.status.container_statuses[0].state.terminated.finished_at - + res.status.container_statuses[0].state.terminated.started_at +})
187-187
: Optimize dictionary key lookupsReplace
key in dict.keys()
withkey in dict
for better performance.-if "machine_type" in job.resources.keys(): +if "machine_type" in job.resources: -if "mem_mb" in job.resources.keys(): +if "mem_mb" in job.resources: -if "disk_mb" in job.resources.keys(): +if "disk_mb" in job.resources: -if "gpu" in job.resources.keys(): +if "gpu" in job.resources:Also applies to: 260-260, 267-267, 273-273
🧰 Tools
🪛 Ruff
187-187: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py
(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff
snakemake_executor_plugin_kubernetes/__init__.py
187-187: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
260-260: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
267-267: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
273-273: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
304-304: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
192-216
: GPU node selection implementation looks good!
The implementation correctly:
- Sets the node selector using the standard
nvidia.com/gpu
label - Adds appropriate toleration for GPU nodes
- Includes helpful debug logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic!
container.resources.requests["nvidia.com/gpu"] = gpu_count | ||
container.resources.limits["nvidia.com/gpu"] = gpu_count | ||
self.logger.debug(f"Requested GPU resources: {gpu_count}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about non-nvidia GPUs? Maybe infer this from a gpu_model: str
resource, which could allow types like nvidia
or more specifically tesla
etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I specify the machine type already using the machine_type="g2-standard-8
. But also I believe with GKE the node pool inherently specifies "nvidia.com/gpu"
, without it the node scaling doesn't get triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but this plugin is meant to be generic, so we should try to ensure that added functionality works with any k8s cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive tested this method with every available GPU type offered via GKE. Its agnostic to GPU type and only looks for machine type.
See the below K8 pod description. Im selecting for machine type and not accelerator type. I can't think of any circumstance where you would select for an accelerator. You should be doing that via your terraform script or however else you build the cluster outside of snakemake.
autoscaling:
enabled: true
locationPolicy: BALANCED
maxNodeCount: 5
config:
accelerators:
- acceleratorCount: '1'
acceleratorType: nvidia-tesla-t4
gpuDriverInstallationConfig:
gpuDriverVersion: LATEST
labels:
accelerator: nvidia
role: kubernetes
loggingConfig: {}
machineType: n1-standard-16
Its fine if you don't want to include this, but at least change the documentation to accurately reflect the fact that there is no GPU support offered via this plugin. It took me a lot of dev time to get to the point where I came to this realization and I'm sure I'm not the only one - sort of defeats the purpose of using K8 in the first place. Also BTW google life sciences API is depreciated so snakemake really has no native GPU resource support as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not mean to put this down in any way. I just try to keep the k8s plugin independent of GKE. And I thought that this nvidia.com/gpu is very GKE specific and that we maybe would be able to offer a more generic implementation. As you are the expert here and not me, I might get it completely wrong though. Maybe we could schedule a short video meeting to discuss the details and help me understanding the problem? Just send me an email to [email protected].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought further about this, maybe I understand a bit better now. First, there seems to be an AMD plugin for k8s that handles gpu resources similarly: https://github.com/ROCm/k8s-device-plugin, also see https://github.com/ROCm/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml.
Second, I agree that the gpu model is maybe usually something that is controlled by the machine type. Also, it does not seem necessary to control the model itself via the workflow. What might make sense though is to control the manufacturer via the workflow, e.g. if you have a rule that uses cuda, you will want to annotate in your workflow that it requires an nvidia gpu.
So, maybe the following would work:
We suggest two standard resources gpu=...
and gpu_manufacturer=...
, where the latter is optional and might usually be just set via --default-resources gpu_manufacturer=nvidia or gpu_manufacturer=amd, depending on what kind of k8s cluster one has.
Then, in this plugin, we use nvidia.com/gpu
in case the job has a resource gpu_manufacturer=nvidia
and amd.com/gpu
in the amd case. If the requested gpu_manufacturer is unsupported, the plugin can throw an error.
Does that make sense, or do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can shoot you an email.
And yes that makes sense to me. Although it was only 6 months ago that a toolkit was established to allow AMD GPUs to execute CUDA code, which is what every scientific/ML tool uses for GPU acceleration. So the use cases for specifying gpu manufacturer might be pretty rare. It doesnt even appear that GKE offers AMD gpus.
The other way to look at this is I'd imagine anyone getting this far is going to have a fairly advanced understanding of K8 and Terraform. So we could just establish enough resource specifications to construct the node selector themselves. I believe the current behavior is the scheduler just chooses the node with the minimum required computational resources, which is fine for most scenarios. But you could imagine a scenario when you want to choose a very specific node in a large cluster regardless of whether it has a gpu or not.
Automatic handling of GPU resources
Better logging
Better exception handling
Resource requests and limit adjustments
cpu_scalar
Resource definition now becomes
Summary by CodeRabbit