Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i have a Kargo Instance that run in a large cluster, when the Kargo-api pod is starting the k8s client is failing #3149

Open
tal-hason opened this issue Dec 17, 2024 · 6 comments

Comments

@tal-hason
Copy link
Contributor

tal-hason commented Dec 17, 2024

Checklist

  • [V ] I've searched the issue queue to verify this is not a duplicate bug report.
  • [ V] I've pasted the output of kargo version.
  • [V ] I've pasted logs, if applicable.

Description

I have a Big Cluster:
24 Nodes
and it has a lot of events.

when the Kargo-api loads its trys to list all events from the cluster-wide.
what can take some time, and then it reaches the timeout and crahes.

Version

1.0.3, 1.0.4, 1.1.1

Logs

time="2024-12-17T10:30:33Z" level=info msg="Starting Kargo API Server" GOMAXPROCS=160 GOMEMLIMIT=1620368842752 commit=d9932c7379444b0cc885c05fbc735f4495c65463 version=v1.1.1
time="2024-12-17T10:30:33Z" level=debug msg="loading in-cluster REST config"
W1217 10:31:12.969504       1 reflector.go:561] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.Event: Get "https://172.30.0.1:443/api/v1/events?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6NDg5MDk2MjM3Mywic3RhcnQiOiJyaHYtY252LWRlbW8tcGV5dS9rdWJldmlydC1kaXNydXB0aW9uLWJ1ZGdldC05emNiYy4xODExZTkzYmMyOThiYzY2XHUwMDAwIn0&limit=500": context canceled
W1217 10:31:12.969577       1 reflector.go:484] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.Freight ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W1217 10:31:12.969699       1 reflector.go:484] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1.ServiceAccount ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
time="2024-12-17T10:31:12Z" level=error error="error creating Kubernetes client for Kargo API server: error building internal client: error waiting for cache sync"
@krancour
Copy link
Member

@hiddeco @gdsoumya we only write events, right?

I'm thinking we probably should be excluding them from the cache.

@gdsoumya
Copy link
Contributor

gdsoumya commented Dec 19, 2024

@hiddeco @gdsoumya we only write events, right?

I'm thinking we probably should be excluding them from the cache.

As far as I remember we only write events so we can skip it from cache.

@hiddeco
Copy link
Contributor

hiddeco commented Dec 19, 2024

I think we can not disable them in this case, because the logs are from the API server and we actively list events there to facilitate showing the Events for a Project in the UI.

@krancour
Copy link
Member

I forgot the UI could show them.

Could we conceivably use field selector involvedObject.apiVersion to filter what is cached?

This field selector seems to be supported right out of the box:

https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/#list-of-supported-fields

Tangent: It makes me wonder if the involvedObject.apiGroup index we build might actually be unnecessary.

@hiddeco
Copy link
Contributor

hiddeco commented Dec 19, 2024

With the way the client offers configuration options, I do not think this is possible.

@krancour
Copy link
Member

Ok. We may have to out a pin in this one for now.

As we get to working on more operator-focused docs, we may want to recommend not running the Kargo control plane in such a large/busy cluster. It's not at all uncommon to have dedicated "management clusters" loaded up with things like Kargo, Argo CD, etc.

We could potentially add an option to disable the event list endpoint and UI tab. If opted out of that, we would never need to list/cache events. We'd still create events, but wouldn't make them accessible via API/UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants