Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All GKs are re-synced in cluster cache even if only one fails #520

Open
crenshaw-dev opened this issue May 10, 2023 · 0 comments
Open

All GKs are re-synced in cluster cache even if only one fails #520

crenshaw-dev opened this issue May 10, 2023 · 0 comments

Comments

@crenshaw-dev
Copy link
Member

The full cluster resource cache is built 1) on startup, 2) every 24 hours (by default, configurable) and 3) every 10 seconds (by default, configurable) if there's an error while building the cache.

The third can be a problem.

Suppose we have a cluster with 100 API group/kinds (GKs). And suppose, for some weird reason, the cluster has a LOT of some particular GK (let's say RoleBindings). When gitops-engine syncs the cluster cache, it will list all the RoleBindings. If the "continue token" (used for pagination) expires while listing all those RoleBindings, gitops-engine will note that the sync is "failed," and 10 seconds later, it will attempt to rebuild the whole cache. Rebuilding the whole cache is incredibly wasteful, especially if all 99 other GKs were successfully cached.

The problem can exacerbate itself. By hammering the k8s API with requests every 10 seconds for every resource, we likely increase k8s response times and increase the likelihood of errors.

To see if you're affected by this problem, search your logs for "Start syncing cluster". It should only happen every 24 hours by default. If you're seeing it more often than that, you're impacted.

I recommend two mitigations:

  1. Only retry the GKs which experienced errors. If we successfully cache 99 GKs, don't attempt to re-load those items.
  2. Back-off retries instead of using a static 10s timeout. If the problem is caused by cluster load, maybe we alleviate the issue by retrying less often.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant