Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPM] Restoration of cluster fails if it's Infrastructure resource on the source Seed was annotated with migration.azure.provider.extensions.gardener.cloud/zone #827

Open
plkokanov opened this issue Apr 14, 2024 · 1 comment · May be fixed by #907
Assignees
Labels
area/control-plane-migration Control plane migration related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure

Comments

@plkokanov
Copy link
Contributor

plkokanov commented Apr 14, 2024

How to categorize this issue?

/area control-plane-migration
/kind bug
/platform azure

What happened:
During control plane migration of an HA shoot cluster (using zones z1, z2, and z3), for which the infrastructure resource is annotated with migration.azure.provider.extensions.gardener.cloud/zone, the infrastructure resource is not successfully restored with the following error:

* creating Subnet: (Name "<vnet-name>-nodes-z3" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="NetcfgSubnetRangesOverlap" Message="Subnet '<vnet-name>-nodes-z3' is not valid because its IP address range overlaps with that of an existing subnet in virtual network '<vnet-name>'." Details=[]
  with azurerm_subnet.workers-z3,
  on main.tf line 167, in resource "azurerm_subnet" "workers-z3":
 167: resource "azurerm_subnet" "workers-z3" {
* deleting Subnet: (Name "<vnet-name>-nodes" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet<vnet-name>-nodes is in use by /subscriptions/<omitted>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<nic-id>-NIC/ipConfigurations/<nic-id>-NIC and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]]

Basically, during the restore phase of control plane migration for the inrastructure resource the provider-azure extension tried to delete the <vnet-name>-nodes subnet and create <vnet-name>-nodes-z3. This seems to have happened because the infrastructure resource in the destination seed did not have an migration.azure.provider.extensions.gardener.cloud/zone: "3" annotation.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The migration.azure.provider.extensions.gardener.cloud/zone annotation is put on the infrastructure resource via a mutating webhook here:

for _, z := range newProviderCfg.Networks.Zones {
if z.CIDR == *oldProviderCfg.Networks.Workers {
extensionswebhook.LogMutation(logger, newInfra.Kind, newInfra.Namespace, newInfra.Name)
if newInfra.Annotations == nil {
newInfra.Annotations = make(map[string]string)
}
newInfra.Annotations[azuretypes.NetworkLayoutZoneMigrationAnnotation] = helper.InfrastructureZoneToString(z.Name)
return nil
}
}

In this case, this mutating code did not get executed because of the following:

  1. As part of normal reconciliation of the infrastructure resource its .status.providerStatus field is saved in the .status.state.providerStatus.
  2. During the migrate phase of CPM gardenlet takes this .status.state.savedProviderStatus and saves it in the ShootState
  3. During the restore phase of CPM gardenlet creates an infrastructure resource in the destination seed, then it copies the .status.state.savedProviderStatus from the ShootState and adds it to the infrastructure's .status.state.savedProviderStatuss field.
  4. Afterwards, gardenlet annotates the the infrastructure resource with gardener.cloud/operation: restore to trigger restoration.

During the updates to the infrastructure resource in 3 and 4 the mutating webhook does not make any changes as it exits early due to these checks:

if oldInfra.Status.ProviderStatus != nil {
oldProviderStatus, err = helper.InfrastructureStatusFromRaw(oldInfra.Status.ProviderStatus)
if err != nil {
return fmt.Errorf("could not mutate object: %v", err)
}
}
// take care of clusters that have not been reconciliated for a long time (hibernated etc). In this case they may
// not have the Layout field populated.
if oldProviderStatus != nil &&
oldProviderStatus.Networks.Layout != "" &&
oldProviderStatus.Networks.Layout != azure.NetworkLayoutSingleSubnet {
return nil
}

Even if the status.providerState is patched with the one from the status.state.providerState, the mutating webhook would still not perform any changes because the status.providerState would contain the following:

  "providerStatus": {
    "apiVersion": "azure.provider.extensions.gardener.cloud/v1alpha1",
    "availabilitySets": [],
    "kind": "InfrastructureStatus",
    "networks": {
      "layout": "MultipleSubnet",

Hence nil is returned here:

oldProviderStatus.Networks.Layout != azure.NetworkLayoutSingleSubnet {
return nil
}

What you expected to happen:
Cluster to be restored successfully.

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/control-plane-migration Control plane migration related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure labels Apr 14, 2024
@plkokanov plkokanov changed the title [CPM] Restoration of cluster fails if it's infrastructure resource on the source seed is annotated with migration.azure.provider.extensions.gardener.cloud/zone [CPM] Restoration of cluster fails if it's Infrastructure resource on the source Seed is annotated with migration.azure.provider.extensions.gardener.cloud/zone Apr 14, 2024
@plkokanov plkokanov changed the title [CPM] Restoration of cluster fails if it's Infrastructure resource on the source Seed is annotated with migration.azure.provider.extensions.gardener.cloud/zone [CPM] Restoration of cluster fails if it's Infrastructure resource on the source Seed was annotated with migration.azure.provider.extensions.gardener.cloud/zone Apr 14, 2024
@plkokanov
Copy link
Contributor Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane-migration Control plane migration related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants