Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mrburke00
Copy link

@mrburke00 mrburke00 commented Nov 14, 2024

  • Automatic handling of GPU resources

    • Added node selectors and tolerations for when job requests GPU
    • Ensured GPU resource requests and limits are set appropriately in the pod specification
  • Better logging

    • Logged complete pod specification before pod creation for easier debugging
  • Better exception handling

    • Added exception handling around pod creation to catch and log API exceptions as well as job status checks
  • Resource requests and limit adjustments

    • Set both resource requests and limits for CPU, memory, and ephemeral storage to ensure correct scheduling
    • Adjusted CPU requests to account for the cpu_scalar
  • Resource definition now becomes

resources:
    gpu=1,
    machine_type="g2-standard-8",
    mem_mb=8000
    threads=8   

Summary by CodeRabbit

  • New Features
    • Introduced GPU resource requests for job specifications.
    • Added support for persistent volume mounts in job containers.
  • Improvements
    • Enhanced logging for job execution and status checks.
    • Improved error handling for Kubernetes API calls.
    • Refined job cancellation and shutdown processes for better resource management.

Copy link

coderabbitai bot commented Nov 14, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in this pull request focus on enhancing the functionality of the Executor class within the snakemake_executor_plugin_kubernetes module. Key updates include the addition of GPU resource requests, improved error handling in job status checks, and enhanced logging for job execution. Furthermore, the ExecutorSettings dataclass has been updated to include a new field for persistent volumes, allowing for better resource management in Kubernetes job containers.

Changes

File Path Change Summary
snakemake_executor_plugin_kubernetes/__init__.py - Enhanced Executor class with GPU resource requests and tolerations for GPU nodes.
- Updated run_job method with detailed logging and node selector setup.
- Improved error handling in check_active_jobs method with additional logging for API exceptions.
- Modified ExecutorSettings to include persistent_volumes field and defined parsing functions.
- Added PersistentVolume dataclass for handling persistent volume specifications.

Possibly related PRs

Suggested reviewers

  • johanneskoester

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@mrburke00 mrburke00 changed the title Added GPU node pool support for GKE, better logging, error and exception handling feat:Added GPU node pool support for GKE, better logging, error and exception handling Nov 14, 2024
@mrburke00 mrburke00 changed the title feat:Added GPU node pool support for GKE, better logging, error and exception handling feat: Added GPU node pool support for GKE, better logging, error and exception handling Nov 14, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
snakemake_executor_plugin_kubernetes/__init__.py (6)

189-189: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:

-if "machine_type" in job.resources.keys():
+if "machine_type" in job.resources:

260-260: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:

-if "mem_mb" in job.resources.keys():
+if "mem_mb" in job.resources:
🧰 Tools
🪛 Ruff

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


267-267: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:

-if "disk_mb" in job.resources.keys():
+if "disk_mb" in job.resources:
🧰 Tools
🪛 Ruff

267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


273-273: Simplify dictionary key existence check

You can simplify the condition by removing the .keys() call when checking if a key exists in a dictionary.

Apply this diff to simplify the key check:

-if "gpu" in job.resources.keys():
+if "gpu" in job.resources:
🧰 Tools
🪛 Ruff

273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


302-304: Preserve exception context when re-raising

When re-raising an exception in an except block, it's best practice to preserve the original exception context by using from e.

Apply this diff to maintain the exception context:

 except kubernetes.client.rest.ApiException as e:
     self.logger.error(f"Failed to create pod: {e}")
-    raise WorkflowError(f"Failed to create pod: {e}")
+    raise WorkflowError(f"Failed to create pod: {e}") from e
🧰 Tools
🪛 Ruff

304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


193-196: Consider making node selector labels configurable

Hardcoding the node selector label "accelerator": "nvidia" may not be flexible for different cluster configurations. Consider making the node selector labels for GPU nodes configurable to accommodate various environments.

This will enhance the portability of your code across different Kubernetes clusters.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c15750a and 91e6a32.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
snakemake_executor_plugin_kubernetes/__init__.py

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

205-217: Confirm toleration for GPU nodes is appropriate

The toleration is added for key "nvidia.com/gpu" with value "present". Please confirm that this toleration aligns with the taints applied to GPU nodes in your Kubernetes cluster.

Run the following script to list node taints and verify GPU node taints:

#!/bin/bash
# Description: List taints on nodes to verify the correct taint for GPU nodes

# List all nodes with their taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'

snakemake_executor_plugin_kubernetes/__init__.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
snakemake_executor_plugin_kubernetes/__init__.py (4)

179-183: Consider adding mount path validation

The persistent volume mount implementation could benefit from path validation to ensure the paths are absolute and don't contain problematic characters.

 for pvc in self.persistent_volumes:
+    path_str = str(pvc.path)
+    if not path_str.startswith('/'):
+        raise ValueError(f"Mount path must be absolute: {path_str}")
     container.volume_mounts.append(
-        kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=str(pvc.path))
+        kubernetes.client.V1VolumeMount(name=pvc.name, mount_path=path_str)
     )

247-278: Add validation for resource values

While the resource allocation logic is correct, consider adding validation to ensure resource values are within reasonable bounds.

+def validate_resource_value(value: int, resource_type: str, min_value: int = 1):
+    if value < min_value:
+        raise ValueError(f"{resource_type} request must be >= {min_value}, got {value}")

 # CPU and memory requests
+validate_resource_value(int(job.resources["_cores"]), "CPU cores")
 container.resources.requests["cpu"] = "{}m".format(
     int(job.resources["_cores"] * self.k8s_cpu_scalar * 1000)
 )

 if "mem_mb" in job.resources.keys():
+    validate_resource_value(job.resources["mem_mb"], "Memory (MB)")
     container.resources.requests["memory"] = "{}M".format(
         job.resources["mem_mb"]
     )

 if "gpu" in job.resources.keys():
+    validate_resource_value(int(job.resources["gpu"]), "GPU count")
     gpu_count = str(job.resources["gpu"])
🧰 Tools
🪛 Ruff

260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


Line range hint 349-391: Consider using structured logging for better observability

The error handling and status reporting look good, but consider using structured logging for better parsing and monitoring.

-self.logger.error(f"ApiException when checking pod status: {e}")
+self.logger.error("Pod status check failed", extra={
+    "error_type": "ApiException",
+    "pod_id": j.external_jobid,
+    "error_message": str(e),
+    "status_code": e.status
+})

-self.logger.info(f"Job {j.external_jobid} succeeded.")
+self.logger.info("Job completed successfully", extra={
+    "pod_id": j.external_jobid,
+    "duration": res.status.container_statuses[0].state.terminated.finished_at - 
+               res.status.container_statuses[0].state.terminated.started_at
+})

187-187: Optimize dictionary key lookups

Replace key in dict.keys() with key in dict for better performance.

-if "machine_type" in job.resources.keys():
+if "machine_type" in job.resources:

-if "mem_mb" in job.resources.keys():
+if "mem_mb" in job.resources:

-if "disk_mb" in job.resources.keys():
+if "disk_mb" in job.resources:

-if "gpu" in job.resources.keys():
+if "gpu" in job.resources:

Also applies to: 260-260, 267-267, 273-273

🧰 Tools
🪛 Ruff

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 91e6a32 and efa69ee.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
snakemake_executor_plugin_kubernetes/__init__.py

187-187: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


260-260: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


267-267: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


273-273: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


304-304: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

192-216: GPU node selection implementation looks good!

The implementation correctly:

  1. Sets the node selector using the standard nvidia.com/gpu label
  2. Adds appropriate toleration for GPU nodes
  3. Includes helpful debug logging

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic!

Comment on lines +275 to +277
container.resources.requests["nvidia.com/gpu"] = gpu_count
container.resources.limits["nvidia.com/gpu"] = gpu_count
self.logger.debug(f"Requested GPU resources: {gpu_count}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about non-nvidia GPUs? Maybe infer this from a gpu_model: str resource, which could allow types like nvidia or more specifically tesla etc.?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I specify the machine type already using the machine_type="g2-standard-8. But also I believe with GKE the node pool inherently specifies "nvidia.com/gpu", without it the node scaling doesn't get triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but this plugin is meant to be generic, so we should try to ensure that added functionality works with any k8s cluster

Copy link
Author

@mrburke00 mrburke00 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive tested this method with every available GPU type offered via GKE. Its agnostic to GPU type and only looks for machine type.

See the below K8 pod description. Im selecting for machine type and not accelerator type. I can't think of any circumstance where you would select for an accelerator. You should be doing that via your terraform script or however else you build the cluster outside of snakemake.

autoscaling:
  enabled: true
  locationPolicy: BALANCED
  maxNodeCount: 5
config:
  accelerators:
  - acceleratorCount: '1'
    acceleratorType: nvidia-tesla-t4
    gpuDriverInstallationConfig:
      gpuDriverVersion: LATEST
  labels:
    accelerator: nvidia
    role: kubernetes
  loggingConfig: {}
  machineType: n1-standard-16

Its fine if you don't want to include this, but at least change the documentation to accurately reflect the fact that there is no GPU support offered via this plugin. It took me a lot of dev time to get to the point where I came to this realization and I'm sure I'm not the only one - sort of defeats the purpose of using K8 in the first place. Also BTW google life sciences API is depreciated so snakemake really has no native GPU resource support as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not mean to put this down in any way. I just try to keep the k8s plugin independent of GKE. And I thought that this nvidia.com/gpu is very GKE specific and that we maybe would be able to offer a more generic implementation. As you are the expert here and not me, I might get it completely wrong though. Maybe we could schedule a short video meeting to discuss the details and help me understanding the problem? Just send me an email to [email protected].

Copy link
Contributor

@johanneskoester johanneskoester Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought further about this, maybe I understand a bit better now. First, there seems to be an AMD plugin for k8s that handles gpu resources similarly: https://github.com/ROCm/k8s-device-plugin, also see https://github.com/ROCm/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml.
Second, I agree that the gpu model is maybe usually something that is controlled by the machine type. Also, it does not seem necessary to control the model itself via the workflow. What might make sense though is to control the manufacturer via the workflow, e.g. if you have a rule that uses cuda, you will want to annotate in your workflow that it requires an nvidia gpu.

So, maybe the following would work:
We suggest two standard resources gpu=... and gpu_manufacturer=..., where the latter is optional and might usually be just set via --default-resources gpu_manufacturer=nvidia or gpu_manufacturer=amd, depending on what kind of k8s cluster one has.
Then, in this plugin, we use nvidia.com/gpu in case the job has a resource gpu_manufacturer=nvidia and amd.com/gpu in the amd case. If the requested gpu_manufacturer is unsupported, the plugin can throw an error.

Does that make sense, or do I miss something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can shoot you an email.

And yes that makes sense to me. Although it was only 6 months ago that a toolkit was established to allow AMD GPUs to execute CUDA code, which is what every scientific/ML tool uses for GPU acceleration. So the use cases for specifying gpu manufacturer might be pretty rare. It doesnt even appear that GKE offers AMD gpus.

The other way to look at this is I'd imagine anyone getting this far is going to have a fairly advanced understanding of K8 and Terraform. So we could just establish enough resource specifications to construct the node selector themselves. I believe the current behavior is the scheduler just chooses the node with the minimum required computational resources, which is fine for most scenarios. But you could imagine a scenario when you want to choose a very specific node in a large cluster regardless of whether it has a gpu or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants