Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

Open
vara-bonthu opened this issue Aug 15, 2023 · 0 comments

Comments

@vara-bonthu
Copy link

vara-bonthu commented Aug 15, 2023

Description

I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.

Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.

Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.

If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.

NOTE TO MAINTAINERS
I'm eager to contribute by creating a pull request for this exciting new feature, even though I'm still getting familiar with the repository and the whole PyTorch environment. Since I'm new to the process, I'd really appreciate some guidance on how to set up and run TorchX locally, as well as how to carry out unit and integration tests. This knowledge will be invaluable in making sure my contributions align well with the existing code and testing procedures. Thanks a lot for your support!

Motivation/Background

In our current setup, we are utilizing TorchX, Volcano scheduling, and Karpenter autoscaling to manage training jobs on our Amazon EKS cluster. We have specific requirements to target jobs on nodes with certain labels and taints due to the nature of our workloads. However, the existing TorchX functionality only allows for specifying a single NodeSelector label, which is limiting for our use case. Additionally, we need the ability to incorporate tolerations into our job specifications for effective scheduling.

Detailed Proposal

I propose enhancing the TorchX functionality to allow users to provide multiple NodeSelector labels as a Dict[str, str] and tolerations as a list of V1Toleration in the pod definition. This will enable users to precisely target nodes and managed node groups based on a wider range of labels and handle scheduling constraints effectively.

The changes will involve modifying the role_to_pod method to accept two new parameters:

node_selectors: Dict[str, str]: This parameter will allow users to provide multiple node selector labels for their jobs. Modifying the existing one to accept more than one.
tolerations: List[V1Toleration]: This parameter will allow users to provide tolerations to handle node taints effectively.

These parameters will be included in the pod specification when creating a new pod using TorchX and Volcano.

Alternatives

An alternative approach would be to manually modify the generated pod specification after it's created using TorchX. However, this approach would require additional steps and could lead to inconsistencies between the job definition and the actual pod specification.

Additional context/links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant