Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requires node topology labels to be set for known supported instance … #1100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mmerkes
Copy link
Contributor

@mmerkes mmerkes commented Feb 6, 2025

…types

What type of PR is this?
/kind bug

What this PR does / why we need it:
When support for node topology labels was added to the cloud provider, it was assuming that any missing labels would eventually get applied, but that's not the case. As a result, if the node controller gets throttled, the instance metadata will be returned without the labels and the node will go into service without them set.

This PR will keep the same behavior IF we don't expect instance topology to have results, but if it's from the allowlisted instance type patterns, we'll ensure that the node won't be marked as gtg until it's successfully got a response or gets an unrecoverable response (like it's not supported in the region or the cluster role doesn't have permissions).

Which issue(s) this PR fixes:
N/A

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mmerkes. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@mmerkes
Copy link
Contributor Author

mmerkes commented Feb 6, 2025

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2025
@@ -83,17 +83,24 @@ func (c *Cloud) getAdditionalLabels(ctx context.Context, zoneName string, instan
// If topology labels are already set, skip.
if _, ok := existingLabels[LabelNetworkNodePrefix+"1"]; !ok {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, but we may need to revisit this if the topology of an instance can change over time with a node stop + start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The problem is that we may not have a hook to actually update that.


// We don't expect that there will be a response for these instance types anyway,
// so we're going to move on without setting the labels.
klog.Warningf("Failed to get node topology. Moving on without setting labels. Ignoring %v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you match the previous log statement on L89 and include the instanceType that we didn't get a topology for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

// hpc6a.48xlarge | hpc6id.32xlarge | hpc7a.12xlarge | hpc7a.24xlarge | hpc7a.48xlarge | hpc7a.96xlarge | hpc7g.4xlarge | hpc7g.8xlarge | hpc7g.16xlarge
regexp.MustCompile("hpc[0-9]+[a-z]+\\.*"),
// p3dn.24xlarge | p4d.24xlarge | p4de.24xlarge | p5.48xlarge | p5e.48xlarge | p5en.48xlarge
regexp.MustCompile("p[0-9]+([a-z]+)?\\.*"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this ends up matching older hosts that don't support the topology API (say anyone still able to launch a p2 / p3) and those hosts can never join a cluster post this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those hosts can never join a cluster post this change?

This doesn't require topologies to be returned. It just requires a definitive response, so if there's not topologies available, we'll be happy to move on without setting them.

I wonder if this ends up matching older hosts that don't support the topology API (say anyone still able to launch a p2 / p3)

Yes, AFAIK, the availability of those instance types is diminishing as new instance types are available, so I didn't think it was worth adding the additional complexity to differentiate. p4 is the oldest instance type listed in pricing: https://aws.amazon.com/ec2/instance-types/

@@ -106,6 +123,17 @@ func (t *instanceTopologyManager) GetNodeTopology(ctx context.Context, instanceT
return nil, nil
}

// DoesInstanceTypeRequireResponse verifies whether or not we expect an instance to have an instance topology response.
func (t *instanceTopologyManager) DoesInstanceTypeRequireResponse(instanceType string) bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include this check on L117 (i.e when len(topologies == 0)) to protect against scenarios where a deleting instance has its topology wiped? That way we don't incorrectly mark that type as unsupported for 24h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be undesirable in the event that our regex is too greedy or there's a type added that matches and is not supported. That being said, I do think it probably requires some refactoring. Either:

  1. Create a instanceTopologyFound set to track what instance types we've seen them in the past.
  2. Don't add instance type to the unsupported list if it matches the regex.

Neither is ideal. It would be nice if we had the definitive response on whether the instance type is supported.

p3dn.24xlarge | p4d.24xlarge | p4de.24xlarge | p5.48xlarge | p5e.48xlarge | p5en.48xlarge
trn1.2xlarge | trn1.32xlarge | trn1n.32xlarge | trn2.48xlarge | trn2u.48xlarge
*/
var defaultInstanceTypePatternRequireResponse = regexp.MustCompile(`^(hpc|trn|p|inf)[0-9]+[a-z]*(\.[0-9a-zA-Z]*)$`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess A-Z is unnecessary...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants