Live Control Plane Migration(CPM) or CPM with zero downtime #10686

acumino · 2024-10-21T07:21:56Z

How to categorize this issue?

/area open-source
/kind enhancement

What would you like to be added:
Currently, shoot control plane migrations cause temporary downtime for the shoot cluster because ETCD needs to be backed up and deleted before being restored in the new seed cluster. During this time, the API server, along with all other control plane components, is also taken down. Although the workload within the shoot cluster continues running, it cannot be reconciled, scaled, or updated, leading to downtime since the control plane is unavailable to users.

We would like to support live Control Plane Migration (CPM), allowing migrations to happen without causing downtime for the API server, thereby preventing downtime for the users. This ensures that the shoot cluster remains fully operational, with continuous availability of control-plane for the users.

We(@acumino, @shafeeqes and @ary1992) conducted a POC on this, and it is feasible to implement. More details can be found here.

Why is this needed:
Prevent downtime during control plane migration, ✨ enabling support for more use cases and scenarios, such as 'seed draining/shoot evictions' or streamlined seed cluster deletions.

acumino · 2024-10-21T07:23:17Z

Live CPM POC/Status

Main problems

ETCD migration
DNS switching
VPN migration

ETCD Migration -

Approaches evaluated -

ETCD mirror maker - https://hackmd.io/@acumino/rkVBtwDTT
etcd-gateway
- Start 3 etcd pods in dest cluster which works as proxy(route traffic to source etcd counterpart pod/member).
- One by one make dest etcd pods as actual member and switch source etcd counterpart pod/member to proxy.
- Once all pods/member has been migrated stop source etcd.
- Momentarily one etcd member would not be available.
6 member etcd cluster - https://excalidraw.com/#json=m3Rfb9ks95PR2rrlw7uJG,FzCYXju78K26L22c_Bo81A

DNS switching -

Since there is only one underlying etcd cluster, we don't have any issue with either of kube-apiservers serving the requests.

VPN migration -

VPN should be available during whole migration for the availability of webhooks in Shoot.

https://excalidraw.com/#json=QAA369tj-S9NJdQHa0jOT,46RN2EILnUYLhLRQ_QhoSw

Actual Flow

As of now two gardenlets works parallely on one shoot based on set of annotation such as-

shoot.gardener.cloud/source-second-stage-ready
shoot.gardener.cloud/target-second-stage-ready

Following diagram explains the flow in more detail -

https://excalidraw.com/#json=g1uxKRPhSQSqNjtLXbpJU,A7nsF0ZBLgq-W9zM0JOcPQ

Things that will/can change during implementation

Instead of loadbalancer service for each etcd pods, we can use Istio(ServiceEntries and other resources).

Points for Discussion

ETCD main name change
Should we redirect source istio traffic to dest istio before DNS migration(and if there is way)
Disabling reverse lookup by etcd by using --experimental-peer-skip-client-san-verification flag.
How does the shoot status looks like during migration?!

Manual Work (Not implemented)

Removal of member in etcd
Adapting seed-authorizer and seed-restriction

Changes required in repos -

etcd-wrapper [Changes]
- disabling reverse lookup
etcd-back-restore [Changes]
- allow adding multiple url for peer and clients
- allow to work with given urls instead of parsing a string
etcd-druid [Changes]
- adapt etcd object to
  - allow peer, initial and client urls
  - allow service endpoint
  - allow disabling reverse lookup
ext-authz-server [Changes]
- allow temp vpn-seed-server services
gardener-extension-provider-aws [Changes]
- shallow deletion of DNS records
gardener [Changes]
- adapting flow

Limitations

We are seeing close to [5-15] second unavalibility of KAPI during DNS migration.
Latency (distant located regions)
- ETCD
  - New ETCD member is able to learn over a very long period of time (Not consistent)
  - ETCD throws error -
    - Not in sync with leader
      - 2024-05-07 09:23:59.320872 W | etcdserver: waiting for ReadIndex response took too long, retrying (sent-request-id: 5398284687236073758, retry-timeout: 500ms)
      - 2024-05-07 09:23:59.748703 W | etcdserver: ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader (request ID want 5398284687236073758, got 5398284687236073757)
    - Timeouts for range requests
      - 2024-05-07 10:11:43.229551 W | etcdserver: read-only range request "key:\"foo\" " with result "range_response_count:0 size:5" took too long (137.889387ms) to execute
- KAPI (After ETCD becomes 6 member cluster)
  - Already runnning replicas are fine
  - If pod restarts KAPI fails to get ready
Old logs and metrics will be lost

Data

ETCD
KAPI downtime -
- kube-apiserver is healthy again after 11 seconds of downtime. (eu-west-1 and eu-west-1)
- kube-apiserver is healthy again after 6 seconds of downtime. (eu-west-1 and eu-west-1)
- kube-apiserver is healthy again after 4 seconds of downtime. (eu-west-1 and eu-west-1)
- kube-apiserver is healthy again after 1 seconds of downtime. (eu-west-1 and eu-north-1)
- kube-apiserver is healthy again after 0 seconds of downtime. (eu-west-1 and eu-west-1)
- kube-apiserver is healthy again after 13 seconds of downtime. (eu-west-1 and eu-north-1)

acumino · 2024-10-21T07:29:21Z

KAPI downtime was eliminated (achieving zero downtime) by introducing a delay in deleting the source KAPI after migrating the DNS record.

adenitiu · 2024-10-21T08:42:38Z

@acumino would you be able to provide an approximate timeline by when this would be available, productive ?

acumino · 2024-10-21T08:47:26Z

@acumino would you be able to provide an approximate timeline by when this would be available, productive ?

@adenitiu The work on this is paused currently to prioritize InPlace. The work will probably start from Q2-2025. Don't have an exact timeline.

gardener-prow bot added area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related kind/enhancement Enhancement, improvement, extension labels Oct 21, 2024

acumino added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live Control Plane Migration(CPM) or CPM with zero downtime #10686

Live Control Plane Migration(CPM) or CPM with zero downtime #10686

acumino commented Oct 21, 2024 •

edited

Loading

acumino commented Oct 21, 2024

acumino commented Oct 21, 2024 •

edited

Loading

adenitiu commented Oct 21, 2024

acumino commented Oct 21, 2024

Live Control Plane Migration(CPM) or CPM with zero downtime #10686

Live Control Plane Migration(CPM) or CPM with zero downtime #10686

Comments

acumino commented Oct 21, 2024 • edited Loading

acumino commented Oct 21, 2024

Live CPM POC/Status

Main problems

ETCD Migration -

DNS switching -

VPN migration -

Actual Flow

Things that will/can change during implementation

Points for Discussion

Manual Work (Not implemented)

Changes required in repos -

Limitations

Data

acumino commented Oct 21, 2024 • edited Loading

adenitiu commented Oct 21, 2024

acumino commented Oct 21, 2024

acumino commented Oct 21, 2024 •

edited

Loading

acumino commented Oct 21, 2024 •

edited

Loading