All GKs are re-synced in cluster cache even if only one fails

The full cluster resource cache is built 1) on startup, 2) every 24 hours (by default, configurable) and 3) every 10 seconds (by default, configurable) if there's an error while building the cache.

The third can be a problem.

Suppose we have a cluster with 100 API group/kinds (GKs). And suppose, for some weird reason, the cluster has a LOT of some particular GK (let's say RoleBindings). When gitops-engine syncs the cluster cache, it will list all the RoleBindings. If the "continue token" (used for pagination) expires while listing all those RoleBindings, gitops-engine will note that the sync is "failed," and 10 seconds later, it will attempt to rebuild the whole cache. Rebuilding the whole cache is incredibly wasteful, especially if all 99 other GKs were successfully cached.

The problem can exacerbate itself. By hammering the k8s API with requests every 10 seconds for every resource, we likely increase k8s response times and increase the likelihood of errors.

To see if you're affected by this problem, search your logs for "Start syncing cluster". It should only happen every 24 hours by default. If you're seeing it more often than that, you're impacted.

I recommend two mitigations:
1) Only retry the GKs which experienced errors. If we successfully cache 99 GKs, don't attempt to re-load those items.
2) Back-off retries instead of using a static 10s timeout. If the problem is caused by cluster load, maybe we alleviate the issue by retrying less often.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

All GKs are re-synced in cluster cache even if only one fails #520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

All GKs are re-synced in cluster cache even if only one fails #520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions