CRD upgrade race condition

### What steps did you take and what happened?

Upgrade capi components with `clusterctl` from `v1.5.2` to `v1.6.0`

`clusterctl upgrade apply` completes successfully. However, on the next change I make to the cluster, reconciliation for Machines gets stuck due to this error when reading the provider machine object:

> "Reconciler error" err="failed to retrieve DockerMachine external object \"my-ns\"/\"m-docker-etcd-1705429991949-cpfqz\": failed to get restmapping: failed to get API group resources: unable to retrieve the complete list of server APIs: infrastructure.cluster.x-k8s.io/v1alpha4: the server could not find the requested resource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="my-ns/m-docker-etcd-fr7bh" namespace="my-ns" name="m-docker-etcd-fr7bh"

This error continues on loop and is not resolved until the container is restarted. Once the container is restarted, the reconciliation continues without any other issue.

This error also appears (sometimes) in the MachineSet and MachineDeployment controllers.

Note: I haven't been able to replicate this using the capi official artifacts and only using EKS-A's fork. I believe this is just because of the nature of the race condition and not because of the difference in code (I haven't been able to find even one patch in the codepath of this issue). 

### What did you expect to happen?

All controllers should be able to continue reconciliation after upgrading with `clusterctl`.

### Cluster API version

`v1.5.2` to `v1.6.0`

### Kubernetes version

1.27

### Anything else you would like to add?

TLDR: capi `v1.6.0` marks `v1alpha4` apis as not served. The problem comes from `controller-runtime` caching the results of the call to list `APIGroup`s. If this call returns `v1alpha4` as one of the available versions for `infrastructure.cluster.x-k8s.io`, the client will try to get the `APIResource` definition for `v1alpha4`. If the call to get this APIResource is made after the api is marked as not served, the client will receive a not found error. Since `v1alpha4` has been cached already as available, the client will try to keep making this call and receive the same error forever.

This is the stack for the previously mentioned error:

```
sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).fetchGroupVersionResources
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:294
sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).addKnownGroupAndReload
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:191
sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).RESTMapping
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:122
sigs.k8s.io/controller-runtime/pkg/client/apiutil.IsGVKNamespaced
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/apimachinery.go:96
sigs.k8s.io/controller-runtime/pkg/client/apiutil.IsObjectNamespaced
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/apimachinery.go:90
sigs.k8s.io/controller-runtime/pkg/cache.(*multiNamespaceCache).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/multi_namespace_cache.go:202
sigs.k8s.io/controller-runtime/pkg/cache.(*delegatingByGVKCache).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/delegating_by_gvk_cache.go:44
sigs.k8s.io/controller-runtime/pkg/client.(*client).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/client.go:348
sigs.k8s.io/cluster-api/controllers/external.Get
  sigs.k8s.io/cluster-api/controllers/external/util.go:43
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcileExternal
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller_phases.go:106
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcileInfrastructure
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller_phases.go:256
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcile
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller.go:297
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).Reconcile
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller.go:222
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
runtime.goexit
  runtime/asm_amd64.s:1598
```

The first time a resource from a particular group is requested, the [restmapper](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.16.3/pkg/client/apiutil/restmapper.go#L118-L119) for the client detects that group hasn't been seen before and it gets the definition for all the available versions of that group. The rest mapper caches both the [group](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.16.3/pkg/client/apiutil/restmapper.go#L252) and the [versions](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.16.3/pkg/client/apiutil/restmapper.go#L207-L216) for that group.

If there is an error trying to get one of the available versions, the rest mapper aborts and [returns an error](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.16.3/pkg/client/apiutil/restmapper.go#L190). The restmapper doesn't distinguish between different errors, so a 404 will result in this call failing until either the group is re-created or the cached is invalidated (the only way to do that is to restart the program).

The issue here is that if the first call to get the available versions returns a version that has been (or will be soon) deleted, the client becomes incapable of making requests for any resource of that group, independently of the version.

My hypothesis here is that the first call to get the APIGroup is made either just before the CRDs are updated or immediately after but before the kube api server cache is refreshed. This call then returns `v1alpha4`. And immediately after, the restmapper's call to get the APIResource for `infrastructure.cluster.x-k8s.io/v1alpha4` returns a not found error since this version has already been marked as not served in the CRD.

IMO the long term solution is to not fail if a cached group version is not found. Actually, this has already been implemented in [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime/pull/2571). It has been released as part of `v0.17.0` but since it's marked as a breaking change, it doesn't seem like it will be backported to `v0.16`. We already bumped [controller-runtime to `v0.17.0`](https://github.com/kubernetes-sigs/cluster-api/pull/9964) but it won't be backported to our `v1.6` branch, so we can't leverage this fix.

I propose to update [`external.Get`](https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/external/util.go#L34-L47) to be able to identify this issue and restart the controller when it happens. This restart should happen at most once (immediately after a CRD upgrade) and it suspect it won't be frequent, since it seems that until now I'm the only one who faced this race condition.

In addition, we could change the upgrade logic in `clusterctl` to:
1. Update all the CRDs (for core and all providers)
2. Wait for the api server to return only the served versions for all groups and only after 
3. Scale back up the controller deployments. 

This wouldn't be enough to guarantee the issue doesn't happen, so I don't think this can be an alternative to restarting the controller when this issue happens. Given this change is more involved and this is only required as a short term solution for the v1.6 releases, I would vote to only implement the first change for now.

### Label(s) to be applied

/kind bug
/area clusterctl


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CRD upgrade race condition #10032

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRD upgrade race condition #10032

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions