-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requires node topology labels to be set for known supported instance … #1100
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/triage accepted |
@@ -83,17 +83,24 @@ func (c *Cloud) getAdditionalLabels(ctx context.Context, zoneName string, instan | |||
// If topology labels are already set, skip. | |||
if _, ok := existingLabels[LabelNetworkNodePrefix+"1"]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now, but we may need to revisit this if the topology of an instance can change over time with a node stop + start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The problem is that we may not have a hook to actually update that.
pkg/providers/v1/instances_v2.go
Outdated
|
||
// We don't expect that there will be a response for these instance types anyway, | ||
// so we're going to move on without setting the labels. | ||
klog.Warningf("Failed to get node topology. Moving on without setting labels. Ignoring %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could you match the previous log statement on L89 and include the instanceType that we didn't get a topology for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
pkg/resourcemanagers/topology.go
Outdated
// hpc6a.48xlarge | hpc6id.32xlarge | hpc7a.12xlarge | hpc7a.24xlarge | hpc7a.48xlarge | hpc7a.96xlarge | hpc7g.4xlarge | hpc7g.8xlarge | hpc7g.16xlarge | ||
regexp.MustCompile("hpc[0-9]+[a-z]+\\.*"), | ||
// p3dn.24xlarge | p4d.24xlarge | p4de.24xlarge | p5.48xlarge | p5e.48xlarge | p5en.48xlarge | ||
regexp.MustCompile("p[0-9]+([a-z]+)?\\.*"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this ends up matching older hosts that don't support the topology API (say anyone still able to launch a p2 / p3) and those hosts can never join a cluster post this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those hosts can never join a cluster post this change?
This doesn't require topologies to be returned. It just requires a definitive response, so if there's not topologies available, we'll be happy to move on without setting them.
I wonder if this ends up matching older hosts that don't support the topology API (say anyone still able to launch a p2 / p3)
Yes, AFAIK, the availability of those instance types is diminishing as new instance types are available, so I didn't think it was worth adding the additional complexity to differentiate. p4 is the oldest instance type listed in pricing: https://aws.amazon.com/ec2/instance-types/
@@ -106,6 +123,17 @@ func (t *instanceTopologyManager) GetNodeTopology(ctx context.Context, instanceT | |||
return nil, nil | |||
} | |||
|
|||
// DoesInstanceTypeRequireResponse verifies whether or not we expect an instance to have an instance topology response. | |||
func (t *instanceTopologyManager) DoesInstanceTypeRequireResponse(instanceType string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include this check on L117 (i.e when len(topologies == 0)
) to protect against scenarios where a deleting instance has its topology wiped? That way we don't incorrectly mark that type as unsupported for 24h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be undesirable in the event that our regex is too greedy or there's a type added that matches and is not supported. That being said, I do think it probably requires some refactoring. Either:
- Create a
instanceTopologyFound
set to track what instance types we've seen them in the past. - Don't add instance type to the unsupported list if it matches the regex.
Neither is ideal. It would be nice if we had the definitive response on whether the instance type is supported.
p3dn.24xlarge | p4d.24xlarge | p4de.24xlarge | p5.48xlarge | p5e.48xlarge | p5en.48xlarge | ||
trn1.2xlarge | trn1.32xlarge | trn1n.32xlarge | trn2.48xlarge | trn2u.48xlarge | ||
*/ | ||
var defaultInstanceTypePatternRequireResponse = regexp.MustCompile(`^(hpc|trn|p|inf)[0-9]+[a-z]*(\.[0-9a-zA-Z]*)$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess A-Z
is unnecessary...
…types
What type of PR is this?
/kind bug
What this PR does / why we need it:
When support for node topology labels was added to the cloud provider, it was assuming that any missing labels would eventually get applied, but that's not the case. As a result, if the node controller gets throttled, the instance metadata will be returned without the labels and the node will go into service without them set.
This PR will keep the same behavior IF we don't expect instance topology to have results, but if it's from the allowlisted instance type patterns, we'll ensure that the node won't be marked as gtg until it's successfully got a response or gets an unrecoverable response (like it's not supported in the region or the cluster role doesn't have permissions).
Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
Does this PR introduce a user-facing change?: