-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout #717
Comments
at the moment I have the same issue in my implementation |
@d4rkhunt33r Thank you for reporting this. Can you share a bit more on the environment setup? Did you use our helm chart? How much cpu/memory did the hubagent use when you create "a lot of objects"? What is the limit of the cpu/memory on the EKS api-server? @Haladinoq I wonder what do you mean by your "in your implementation"? |
@d4rkhunt33r You can increase the hubagent performance by adding the following parameters in your helm chart with some value like below. --set concurrentClusterPlacementSyncs=20 |
@ryanzhang-oss thank you so much for you assitance. i will try to increase those values. you can check in the message from the logs that the error is related to the ClusterResourceSnapshot was unable to return a reponse
|
This is another message that i think can be usefull
|
Thanks. I wonder if you can share the entire log and cluster settings? 30 seconds is a very long time so I am not sure we should relax that. It seems that the API-server on the hub cluster is very over-loaded? Can you share its usage? |
Hi i am sorry for the delay but I was finishing some task for the Q. The reason the Api takes so long to retrieve the ClusterResourceSnapshot is the amount of objects and the load on the server (Very busy one). I think it will be very helpful to have a parameter to control the timeout before the controller fails due to not having a response from the server. Next week I will try to look at the code and see if I can find the place where this could be implemented and study a little more the project to see if this is already posible and I am missing something Sorry for my English is not my native language |
@ryanzhang-oss Good morning I hope you are having a great day. I am still looking at the issue and I was not able to found a way to increase the Kubernetes client timeout neither the CacheSyncTimeout I was hopping this could be added to the project in case someone have an already large cluster and the CacheSync could take more than two minutos
|
i was able to change the CacheSysncTimeout by adding. but I am still seeing the error related to the Timeout listing resources.
is there a place in the code where I can change the 60 seconds timeout for requests to the Kubernetes API server ? |
Hi i was able to find the issue. If you create a ClusterResourcePlacement selecting a lot of namespaces to propagate, the resulting ClusterResourceSnapshots are too large for my kuberntes cluster to parse. That means the timeout was server side as you can see in the following log from my Kubernetes cluster.
It is basically a timeout parsing the json output for the request to cluster resourcesnapshot To fix this, I changed the pagination from 500 ClusterResourceSnapshots to 100 but then I found another issue because the ClusterResourcePlacement was sooo large that updating the object failed due to the limit of bytes to the post operation I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches My code is this mgr, err := ctrl.NewManager(config, ctrl.Options{
Scheme: scheme,
Controller: ctrloption.Controller{
CacheSyncTimeout: 300 * time.Second,
},
Cache: cache.Options{
SyncPeriod: &opts.ResyncPeriod.Duration,
HTTPClient: cacheHttpClient,
},
LeaderElection: opts.LeaderElection.LeaderElect,
LeaderElectionID: opts.LeaderElection.ResourceName,
LeaderElectionNamespace: opts.LeaderElection.ResourceNamespace,
LeaderElectionResourceLock: opts.LeaderElection.ResourceLock,
HealthProbeBindAddress: opts.HealthProbeAddress,
Metrics: metricsserver.Options{
BindAddress: opts.MetricsBindAddress,
},
WebhookServer: ctrlwebhook.NewServer(ctrlwebhook.Options{
Port: FleetWebhookPort,
CertDir: FleetWebhookCertDir,
}),
}) adding Allowed the controllers to have more time to finish syncing caches and no more restart happened |
So my next step is to create ClusterResourcePlacement for each namespace in the hopes to not crash the server due to the size of the Snapshots |
@d4rkhunt33r thank you for the updates. Just curious how you're using CRP. Are you selecting multiple applications using one CRP? It's better to put one application configurations into one CRP, so that the size of selecting resources is under the control.
i think it's a fair request. |
CRP is not really designed for this type of scenario unless you are placing the resource to a different cluster each time. I wonder if you might have looked at etcd backup? |
Hi, @d4rkhunt33r, I wonder if you will place each namespace on a different cluster or just copy all things on cluster A to cluster B? |
@ryanzhang-oss yes my objective is to maintain two clusters with pretty much all namespaces and its objects synchronized in order to have an Active/Active backup of my Istio Control Plane. I was able to create the 9000 ClusterResourcePlacement and the process to sync the objects between clusters took 5 hours for the first run Pretty much all objects were synchronized correctly and it takes about a minute te reflect the changes in a cluster to the replica after a change Now I am curious about the observability of the project. Not sure if there is way to know using metrics if both clusters are in sync. or what objets were not synced |
yes initially I was selecting 9000 namespaces and all its objects in 1 ClusterResourcePlacement and it broke the app :). Now I am creating 1 ClusterResourcePlacement for each Namespace and the Aplication was able to handle the process. about etcd backups, I am using another open source project called velero. the thing is that a backup gets old very quickly and Fleet is something I am hoping can give me something more real time o semi real time |
Describe the bug
Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout
Environment
Please provide the following:
The hub cluster is an aws eks cluster
The hubagent was installed with the following values
To Reproduce
Steps to reproduce the behavior:
You should see the hubagent container to reboot showing the following errores in the logs
The text was updated successfully, but these errors were encountered: