Unexpected failover behavior in karmada #5309
-
Karmada deleted part of resources in a member cluster when the member cluster is down for about ~2minutes, at the same time, Karmada creates all resources again after it recovers. I try to explain this with failover and reproduce it, but the failover and graceful eviction in Karmada requires that:
Karmada needs to wait at least failoverTimeout(5min) + tolerationSeconds(5min) before failover actually happens, which doesn't match the actual recovery time of 2min. After searching for quite a while i still cannot find a possible reasonable cause for this, any suggestions ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Found a possible explanation for this related to karmada/pkg/controllers/status/cluster_status_controller.go Lines 440 to 471 in b4b6d69 In my case the cluster doesn't shutdown right away but in a graceful way, thus reject incoming connections then close existing connections. It's possible for getAPIEnablements to get partial results during the process, which includes only parts of the GVK, and set them into cluster status. This equals to removing GVK from the cluster manually. After the NoSchedule taint is added to the cluster, scheduler will immediatily reschedule all resource bindings binded to the cluster. Now the issue happens, because the unhealthy cluster no longer exists in resource binding, all the existing works will be deleted, and all workloads will be deleted after the cluster is back alive. This also explains why there are only part of the resources being deleted in my case. Maybe worthy a look @XiShanYongYe-Chang |
Beta Was this translation helpful? Give feedback.
Found a possible explanation for this related to
getAPIEnablements
returning partial results.karmada/pkg/controllers/status/cluster_status_controller.go
Lines 440 to 471 in b4b6d69