feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

mrburke00 · 2024-11-14T21:50:22Z

Automatic handling of GPU resources
- Added node selectors and tolerations for when job requests GPU
- Ensured GPU resource requests and limits are set appropriately in the pod specification
Better logging
- Logged complete pod specification before pod creation for easier debugging
Better exception handling
- Added exception handling around pod creation to catch and log API exceptions as well as job status checks
Resource requests and limit adjustments
- Set both resource requests and limits for CPU, memory, and ephemeral storage to ensure correct scheduling
- Adjusted CPU requests to account for the cpu_scalar
Resource definition now becomes

resources:
    gpu=1,
    machine_type="g2-standard-8",
    mem_mb=8000
    threads=8

Summary by CodeRabbit

New Features
- Introduced GPU resource requests for job specifications.
- Added support for persistent volume mounts in job containers.
Improvements
- Enhanced logging for job execution and status checks.
- Improved error handling for Kubernetes API calls.
- Refined job cancellation and shutdown processes for better resource management.

coderabbitai · 2024-11-14T21:50:33Z

📝 Walkthrough

Walkthrough

The changes in this pull request focus on enhancing the functionality of the Executor class within the snakemake_executor_plugin_kubernetes module. Key updates include the addition of GPU resource requests, improved error handling in job status checks, and enhanced logging for job execution. Furthermore, the ExecutorSettings dataclass has been updated to include a new field for persistent volumes, allowing for better resource management in Kubernetes job containers.

Changes

File Path	Change Summary
`snakemake_executor_plugin_kubernetes/__init__.py`	- Enhanced `Executor` class with GPU resource requests and tolerations for GPU nodes. - Updated `run_job` method with detailed logging and node selector setup. - Improved error handling in `check_active_jobs` method with additional logging for API exceptions. - Modified `ExecutorSettings` to include `persistent_volumes` field and defined parsing functions. - Added `PersistentVolume` dataclass for handling persistent volume specifications.

Possibly related PRs

feat: PVC support for kubernetes (continuation of #9) #22: This PR enhances the ExecutorSettings and PersistentVolume classes, which are directly related to the changes made in the main PR regarding the addition of the PersistentVolume class and the persistent_volumes field in ExecutorSettings.
fix: #27 attribute overwriting envvars method in RealExecutor #28: This PR modifies the run_job and check_active_jobs methods in the Executor class, which are also updated in the main PR, indicating a direct connection in the functionality and error handling improvements.

Suggested reviewers

johanneskoester

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)

snakemake_executor_plugin_kubernetes/__init__.py (6)
189-189: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:
-if "machine_type" in job.resources.keys():
+if "machine_type" in job.resources:
260-260: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:
-if "mem_mb" in job.resources.keys():
+if "mem_mb" in job.resources:
🧰 Tools

🪛 Ruff

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

267-267: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:
-if "disk_mb" in job.resources.keys():
+if "disk_mb" in job.resources:
🧰 Tools

🪛 Ruff

267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

273-273: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:
-if "gpu" in job.resources.keys():
+if "gpu" in job.resources:
🧰 Tools

🪛 Ruff

273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

302-304: Preserve exception context when re-raising

When re-raising an exception in an except block, it's best practice to preserve the original exception context by using from e.

Apply this diff to maintain the exception context:
 except kubernetes.client.rest.ApiException as e:
     self.logger.error(f"Failed to create pod: {e}")
-    raise WorkflowError(f"Failed to create pod: {e}")
+    raise WorkflowError(f"Failed to create pod: {e}") from e
🧰 Tools

🪛 Ruff

304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

193-196: Consider making node selector labels configurable

Hardcoding the node selector label "accelerator": "nvidia" may not be flexible for different cluster configurations. Consider making the node selector labels for GPU nodes configurable to accommodate various environments.

This will enhance the portability of your code across different Kubernetes clusters.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c15750a and 91e6a32.

📒 Files selected for processing (1)

snakemake_executor_plugin_kubernetes/__init__.py (9 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

snakemake_executor_plugin_kubernetes/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff

snakemake_executor_plugin_kubernetes/__init__.py

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)

snakemake_executor_plugin_kubernetes/__init__.py (1)

205-217: Confirm toleration for GPU nodes is appropriate

The toleration is added for key "nvidia.com/gpu" with value "present". Please confirm that this toleration aligns with the taints applied to GPU nodes in your Kubernetes cluster.

Run the following script to list node taints and verify GPU node taints:

#!/bin/bash
# Description: List taints on nodes to verify the correct taint for GPU nodes

# List all nodes with their taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'

snakemake_executor_plugin_kubernetes/__init__.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

snakemake_executor_plugin_kubernetes/__init__.py (4)
179-183: Consider adding mount path validation

The persistent volume mount implementation could benefit from path validation to ensure the paths are absolute and don't contain problematic characters.
 for pvc in self.persistent_volumes:
+    path_str = str(pvc.path)
+    if not path_str.startswith('/'):
+        raise ValueError(f"Mount path must be absolute: {path_str}")
     container.volume_mounts.append(
-        kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=str(pvc.path))
+        kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=path_str)
     )
247-278: Add validation for resource values

While the resource allocation logic is correct, consider adding validation to ensure resource values are within reasonable bounds.
+def validate_resource_value(value: int, resource_type: str, min_value: int = 1):
+    if value < min_value:
+        raise ValueError(f"{resource_type} request must be >= {min_value}, got {value}")

 # CPU and memory requests
+validate_resource_value(int(job.resources["_cores"]), "CPU cores")
 container.resources.requests["cpu"] = "{}m".format(
     int(job.resources["_cores"] * self.k8s_cpu_scalar * 1000)
 )

 if "mem_mb" in job.resources.keys():
+    validate_resource_value(job.resources["mem_mb"], "Memory (MB)")
     container.resources.requests["memory"] = "{}M".format(
         job.resources["mem_mb"]
     )

 if "gpu" in job.resources.keys():
+    validate_resource_value(int(job.resources["gpu"]), "GPU count")
     gpu_count = str(job.resources["gpu"])
🧰 Tools

🪛 Ruff

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Line range hint 349-391: Consider using structured logging for better observability

The error handling and status reporting look good, but consider using structured logging for better parsing and monitoring.
-self.logger.error(f"ApiException when checking pod status: {e}")
+self.logger.error("Pod status check failed", extra={
+    "error_type": "ApiException",
+    "pod_id": j.external_jobid,
+    "error_message": str(e),
+    "status_code": e.status
+})

-self.logger.info(f"Job {j.external_jobid} succeeded.")
+self.logger.info("Job completed successfully", extra={
+    "pod_id": j.external_jobid,
+    "duration": res.status.container_statuses[0].state.terminated.finished_at - 
+               res.status.container_statuses[0].state.terminated.started_at
+})
187-187: Optimize dictionary key lookups

Replace key in dict.keys() with key in dict for better performance.
-if "machine_type" in job.resources.keys():
+if "machine_type" in job.resources:

-if "mem_mb" in job.resources.keys():
+if "mem_mb" in job.resources:

-if "disk_mb" in job.resources.keys():
+if "disk_mb" in job.resources:

-if "gpu" in job.resources.keys():
+if "gpu" in job.resources:
Also applies to: 260-260, 267-267, 273-273

🧰 Tools

🪛 Ruff

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 91e6a32 and efa69ee.

📒 Files selected for processing (1)

snakemake_executor_plugin_kubernetes/__init__.py (9 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

snakemake_executor_plugin_kubernetes/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff

snakemake_executor_plugin_kubernetes/__init__.py

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)

snakemake_executor_plugin_kubernetes/__init__.py (1)

192-216: GPU node selection implementation looks good!

The implementation correctly:

Sets the node selector using the standard nvidia.com/gpu label
Adds appropriate toleration for GPU nodes
Includes helpful debug logging

snakemake_executor_plugin_kubernetes/__init__.py

johanneskoester

Fantastic!

johanneskoester · 2024-11-29T08:06:54Z

snakemake_executor_plugin_kubernetes/__init__.py

+            container.resources.requests["nvidia.com/gpu"] = gpu_count
+            container.resources.limits["nvidia.com/gpu"] = gpu_count
+            self.logger.debug(f"Requested GPU resources: {gpu_count}")


What about non-nvidia GPUs? Maybe infer this from a gpu_model: str resource, which could allow types like nvidia or more specifically tesla etc.?

Well I specify the machine type already using the machine_type="g2-standard-8. But also I believe with GKE the node pool inherently specifies "nvidia.com/gpu", without it the node scaling doesn't get triggered.

ok, but this plugin is meant to be generic, so we should try to ensure that added functionality works with any k8s cluster

Ive tested this method with every available GPU type offered via GKE. Its agnostic to GPU type and only looks for machine type.

See the below K8 pod description. Im selecting for machine type and not accelerator type. I can't think of any circumstance where you would select for an accelerator. You should be doing that via your terraform script or however else you build the cluster outside of snakemake.

autoscaling: enabled: true locationPolicy: BALANCED maxNodeCount: 5 config: accelerators: - acceleratorCount: '1' acceleratorType: nvidia-tesla-t4 gpuDriverInstallationConfig: gpuDriverVersion: LATEST labels: accelerator: nvidia role: kubernetes loggingConfig: {} machineType: n1-standard-16

Its fine if you don't want to include this, but at least change the documentation to accurately reflect the fact that there is no GPU support offered via this plugin. It took me a lot of dev time to get to the point where I came to this realization and I'm sure I'm not the only one - sort of defeats the purpose of using K8 in the first place. Also BTW google life sciences API is depreciated so snakemake really has no native GPU resource support as it is.

Sorry, I did not mean to put this down in any way. I just try to keep the k8s plugin independent of GKE. And I thought that this nvidia.com/gpu is very GKE specific and that we maybe would be able to offer a more generic implementation. As you are the expert here and not me, I might get it completely wrong though. Maybe we could schedule a short video meeting to discuss the details and help me understanding the problem? Just send me an email to [email protected].

I have thought further about this, maybe I understand a bit better now. First, there seems to be an AMD plugin for k8s that handles gpu resources similarly: https://github.com/ROCm/k8s-device-plugin, also see https://github.com/ROCm/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml.
Second, I agree that the gpu model is maybe usually something that is controlled by the machine type. Also, it does not seem necessary to control the model itself via the workflow. What might make sense though is to control the manufacturer via the workflow, e.g. if you have a rule that uses cuda, you will want to annotate in your workflow that it requires an nvidia gpu.

So, maybe the following would work:
We suggest two standard resources gpu=... and gpu_manufacturer=..., where the latter is optional and might usually be just set via --default-resources gpu_manufacturer=nvidia or gpu_manufacturer=amd, depending on what kind of k8s cluster one has.
Then, in this plugin, we use nvidia.com/gpu in case the job has a resource gpu_manufacturer=nvidia and amd.com/gpu in the amd case. If the requested gpu_manufacturer is unsupported, the plugin can throw an error.

Does that make sense, or do I miss something?

Yeah I can shoot you an email.

And yes that makes sense to me. Although it was only 6 months ago that a toolkit was established to allow AMD GPUs to execute CUDA code, which is what every scientific/ML tool uses for GPU acceleration. So the use cases for specifying gpu manufacturer might be pretty rare. It doesnt even appear that GKE offers AMD gpus.

The other way to look at this is I'd imagine anyone getting this far is going to have a fairly advanced understanding of K8 and Terraform. So we could just establish enough resource specifications to construct the node selector themselves. I believe the current behavior is the scheduler just chooses the node with the minimum required computational resources, which is fine for most scenarios. But you could imagine a scenario when you want to choose a very specific node in a large cluster regardless of whether it has a gpu or not.

mrburke00 added 4 commits November 14, 2024 14:01

Update __init__.py

2fdf7e5

Update __init__.py

62ba94c

Update __init__.py

ced0f8b

Update __init__.py

91e6a32

mrburke00 changed the title ~~Added GPU node pool support for GKE, better logging, error and exception handling~~ feat:Added GPU node pool support for GKE, better logging, error and exception handling Nov 14, 2024

mrburke00 changed the title ~~feat:Added GPU node pool support for GKE, better logging, error and exception handling~~ feat: Added GPU node pool support for GKE, better logging, error and exception handling Nov 14, 2024

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

snakemake_executor_plugin_kubernetes/__init__.py Outdated Show resolved Hide resolved

Update __init__.py

efa69ee

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

snakemake_executor_plugin_kubernetes/__init__.py Show resolved Hide resolved

johanneskoester requested changes Nov 29, 2024

View reviewed changes

johanneskoester mentioned this pull request Nov 29, 2024

feat!: improved GPU job support snakemake/snakemake-executor-plugin-slurm#173

Open

johanneskoester self-assigned this Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

mrburke00 commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

johanneskoester left a comment

johanneskoester Nov 29, 2024

mrburke00 Dec 4, 2024

johanneskoester Dec 23, 2024

mrburke00 Jan 9, 2025 •

edited

Loading

johanneskoester Jan 14, 2025

johanneskoester Jan 15, 2025 •

edited

Loading

mrburke00 Jan 15, 2025

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

Are you sure you want to change the base?

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

Conversation

mrburke00 commented Nov 14, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Nov 14, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

johanneskoester left a comment

Choose a reason for hiding this comment

johanneskoester Nov 29, 2024

Choose a reason for hiding this comment

mrburke00 Dec 4, 2024

Choose a reason for hiding this comment

johanneskoester Dec 23, 2024

Choose a reason for hiding this comment

mrburke00 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

johanneskoester Jan 14, 2025

Choose a reason for hiding this comment

johanneskoester Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

mrburke00 Jan 15, 2025

Choose a reason for hiding this comment

mrburke00 commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

mrburke00 Jan 9, 2025 •

edited

Loading

johanneskoester Jan 15, 2025 •

edited

Loading