Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

vara-bonthu · 2023-08-15T21:55:30Z

Description

I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.

Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.

Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.

If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.

NOTE TO MAINTAINERS
I'm eager to contribute by creating a pull request for this exciting new feature, even though I'm still getting familiar with the repository and the whole PyTorch environment. Since I'm new to the process, I'd really appreciate some guidance on how to set up and run TorchX locally, as well as how to carry out unit and integration tests. This knowledge will be invaluable in making sure my contributions align well with the existing code and testing procedures. Thanks a lot for your support!

Motivation/Background

In our current setup, we are utilizing TorchX, Volcano scheduling, and Karpenter autoscaling to manage training jobs on our Amazon EKS cluster. We have specific requirements to target jobs on nodes with certain labels and taints due to the nature of our workloads. However, the existing TorchX functionality only allows for specifying a single NodeSelector label, which is limiting for our use case. Additionally, we need the ability to incorporate tolerations into our job specifications for effective scheduling.

Detailed Proposal

I propose enhancing the TorchX functionality to allow users to provide multiple NodeSelector labels as a Dict[str, str] and tolerations as a list of V1Toleration in the pod definition. This will enable users to precisely target nodes and managed node groups based on a wider range of labels and handle scheduling constraints effectively.

The changes will involve modifying the role_to_pod method to accept two new parameters:

node_selectors: Dict[str, str]: This parameter will allow users to provide multiple node selector labels for their jobs. Modifying the existing one to accept more than one.
tolerations: List[V1Toleration]: This parameter will allow users to provide tolerations to handle node taints effectively.

These parameters will be included in the pod specification when creating a new pod using TorchX and Volcano.

Alternatives

An alternative approach would be to manually modify the generated pod specification after it's created using TorchX. However, this approach would require additional steps and could lead to inconsistencies between the job definition and the actual pod specification.

Additional context/links

The text was updated successfully, but these errors were encountered:

vara-bonthu mentioned this issue Aug 16, 2023

feat: Trainium on EKS architecture, karpenter support and more examples awslabs/data-on-eks#295

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

vara-bonthu commented Aug 15, 2023 •

edited

Loading

Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

Feature: Support for Multiple NodeSelectors and Tolerations in TorchX for Kubernetes #753

Comments

vara-bonthu commented Aug 15, 2023 • edited Loading

Description

Motivation/Background

Detailed Proposal

Alternatives

Additional context/links

vara-bonthu commented Aug 15, 2023 •

edited

Loading