Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade kueue default version to v0.9.0 to support TAS #3260

Open
wants to merge 2 commits into
base: experimental
Choose a base branch
from

Conversation

ighosh98
Copy link
Contributor

@ighosh98 ighosh98 commented Nov 14, 2024

Added support for kueue v0.9.0 which was released recently with TAS enabled for all clusters. As aligned offline, Kueue-v0.9.0 would now be the default Kueue version choice for customers.

kueue-v0.9.0 has TAS enabled in alpha phase. Have added necessary tolerations to kueue-v0.9.0 file to make the manifest compatible with cluster-toolkit.

Testing Information

Testing was done on a2-high clusters.

Ran the command:
kubectl get pods -o custom-columns="Name:.metadata.name,Host:.spec.nodeSelector.cloud\.google\.com/gce-topology-host" | sort -k2
Job Annotation:

        kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-host"

Output:
Was testing host topology compliance required job. Observe that all pods are on the same host:

tas-sample-small-required-host-gqm74-0-sksh6   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-gqm74-1-gls2s   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-gqm74-2-7vchk   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-gqm74-3-rnhqv   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-x72sr-0-5bh4v   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-x72sr-1-h7b9k   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-x72sr-2-c4prq   2131055a3f84f1c773b6cddca39d1a24
tas-sample-small-required-host-x72sr-3-gwkb9   2131055a3f84f1c773b6cddca39d1a24
Name                                           Host 

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@ighosh98 ighosh98 added the enhancement New feature or request label Nov 14, 2024
@@ -41,15 +41,15 @@ variable "kueue" {
description = "Install and configure [Kueue](https://kueue.sigs.k8s.io/docs/overview/) workload scheduler. A configuration yaml/template file can be provided with config_path to be applied right after kueue installation. If a template file provided, its variables can be set to config_template_vars."
type = object({
install = optional(bool, false)
version = optional(string, "v0.8.1")
version = optional(string, "v0.9.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not update the default , but use the version we want from the blueprint

Copy link
Contributor Author

@ighosh98 ighosh98 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we should not update the default? My rationale for updating was that the changes should be backward compatible across versions and TAS support is independent of the node type we choose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on all versions being backward compatible and updating the default eventually @ighosh98. We should wait till we know for a fact that 0.9.0 is stable though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check with the Kueue team once on their confidence wrt the stability question. My understanding is that if they release a version, it should be stable.
If the concern is about TAS being in alpha phase and we want to reduce the blast radius associated with it, then I am aligned with changing the default to the older version until we are out oh alpha.
Let me know which one it is.

condition = !var.kueue.install || contains(["v0.8.1"], var.kueue.version)
error_message = "Supported version of Kueue is v0.8.1"
condition = !var.kueue.install || contains(["v0.9.0", "v0.8.1"], var.kueue.version)
error_message = "Default Supported version of Kueue is v0.9.0. Cluster toolkit also supports v0.8.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update error message to just output supported versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating this

config_path = optional(string, null)
config_template_vars = optional(map(any), null)
})
default = {}

validation {
condition = !var.kueue.install || contains(["v0.8.1"], var.kueue.version)
error_message = "Supported version of Kueue is v0.8.1"
condition = !var.kueue.install || contains(["v0.9.0", "v0.8.1"], var.kueue.version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store this in a list, and use it in the error message as well as the condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting clarification: Isn't ["v0.9.0", "v0.8.1"] already a list? Are you recommending that we should store it in a new variable and then use it elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, use a new variable and avoid the hardcoded versions in the error message. Use the variable in the error message

@ighosh98 ighosh98 self-assigned this Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants