You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.
Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.
Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.
If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.
NOTE TO MAINTAINERS I'm eager to contribute by creating a pull request for this exciting new feature, even though I'm still getting familiar with the repository and the whole PyTorch environment. Since I'm new to the process, I'd really appreciate some guidance on how to set up and run TorchX locally, as well as how to carry out unit and integration tests. This knowledge will be invaluable in making sure my contributions align well with the existing code and testing procedures. Thanks a lot for your support!
Motivation/Background
In our current setup, we are utilizing TorchX, Volcano scheduling, and Karpenter autoscaling to manage training jobs on our Amazon EKS cluster. We have specific requirements to target jobs on nodes with certain labels and taints due to the nature of our workloads. However, the existing TorchX functionality only allows for specifying a single NodeSelector label, which is limiting for our use case. Additionally, we need the ability to incorporate tolerations into our job specifications for effective scheduling.
Detailed Proposal
I propose enhancing the TorchX functionality to allow users to provide multiple NodeSelector labels as a Dict[str, str] and tolerations as a list of V1Toleration in the pod definition. This will enable users to precisely target nodes and managed node groups based on a wider range of labels and handle scheduling constraints effectively.
The changes will involve modifying the role_to_pod method to accept two new parameters:
node_selectors: Dict[str, str]: This parameter will allow users to provide multiple node selector labels for their jobs. Modifying the existing one to accept more than one. tolerations: List[V1Toleration]: This parameter will allow users to provide tolerations to handle node taints effectively.
These parameters will be included in the pod specification when creating a new pod using TorchX and Volcano.
Alternatives
An alternative approach would be to manually modify the generated pod specification after it's created using TorchX. However, this approach would require additional steps and could lead to inconsistencies between the job definition and the actual pod specification.
Additional context/links
The text was updated successfully, but these errors were encountered:
Description
I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.
Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.
Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.
If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.
NOTE TO MAINTAINERS
I'm eager to contribute by creating a pull request for this exciting new feature, even though I'm still getting familiar with the repository and the whole PyTorch environment. Since I'm new to the process, I'd really appreciate some guidance on how to set up and run TorchX locally, as well as how to carry out unit and integration tests. This knowledge will be invaluable in making sure my contributions align well with the existing code and testing procedures. Thanks a lot for your support!
Motivation/Background
In our current setup, we are utilizing TorchX, Volcano scheduling, and Karpenter autoscaling to manage training jobs on our Amazon EKS cluster. We have specific requirements to target jobs on nodes with certain labels and taints due to the nature of our workloads. However, the existing TorchX functionality only allows for specifying a single NodeSelector label, which is limiting for our use case. Additionally, we need the ability to incorporate tolerations into our job specifications for effective scheduling.
Detailed Proposal
I propose enhancing the TorchX functionality to allow users to provide multiple
NodeSelector
labels as aDict[str, str]
andtolerations
as a list ofV1Toleration
in the pod definition. This will enable users to precisely target nodes and managed node groups based on a wider range of labels and handle scheduling constraints effectively.The changes will involve modifying the
role_to_pod
method to accept two new parameters:node_selectors: Dict[str, str]: This parameter will allow users to provide multiple node selector labels for their jobs. Modifying the existing one to accept more than one.
tolerations: List[V1Toleration]: This parameter will allow users to provide tolerations to handle node taints effectively.
These parameters will be included in the pod specification when creating a new pod using TorchX and Volcano.
Alternatives
An alternative approach would be to manually modify the generated pod specification after it's created using TorchX. However, this approach would require additional steps and could lead to inconsistencies between the job definition and the actual pod specification.
Additional context/links
The text was updated successfully, but these errors were encountered: