-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for polling-first architecture #191
Comments
The reconcile will set the labels if those change. Relying on labels is just a common practice. You can break a service by changing labels, for instance. So, I'm not sure that's a real flaw in k8s world.
To avoid this, it only watches for
Could you link that? I'd like to take a look. |
This looks like "watch every HelmRelease" to me: It's kind of moot, though, because using a cache for each cluster as I suggest would result in at least namespace-scoped watches too. I think it's a bigger problem that it unconditionally starts watches for every cluster it can see, rather than just those involved in Pipelines.
The pipeline reconciler sets them, and (looks to me) that it'll only do it every interval. If you update the labels on the HelmRelease yourself on the other hand, I suspect it'll fire the watch and the informer will act on it straight away (with bogus labels). Or if not that, there's a window of however many minutes where an update could happen with consequences. (Side note -- I'm not sure the controller removes labels when a pipelines changes.). More generally, if we can avoid dispatching on information from a downstream cluster, that can be changed by users, we should -- because it represents a possible exploit.
To this wider point: labels are certainly used to link dependent objects in a decoupled way, under control of the user (Service falls into this category). Here and there they are used by controllers when there's no other way; e.g., if an ownerRef would be inappropriate. But, here they are an unreliable mechanism, and I don't think we need them -- we can use indexing.
Here it is: https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/remote/cluster_cache_tracker.go It is a public API, but I don't think it's a reliably public API, since it's deep down in a controller package. Also: it doesn't export a method for removing a downstream cluster cache. So it's not ideal; on the other hand, it'd be pretty simple to adapt. |
Indeed, but my intention with that
I think we can create a set of all used targets by the pipelines and only listen to them.
The goal of the labels is to help the controller to identify the related pipeline. Not sure how that index would work, but would indeed be more reliable. |
What I suggest we do is write out two designs: the design as embodied by #180, and the design starting from the naive approach I outlined at the top. In both cases, we should address problems we've discussed above, and can otherwise anticipate, and incorporate those into the design. Considering viable alternatives will give us more confidence in whatever we choose to do. This can be an RFC in weaveworks/weave-gitops-private, since that seems to be where they go. Would you be willing to start us off with a draft @luizbafilho? |
As a sketch: when examining a Pipeline object, make an entry in the index for the mapping This differs a little from the standard pattern of indexing dependent objects in controller-runtime, because the objects are in general in another cluster, and each of those has its own watcher. So it needs a bit of careful arranging so that the individual cluster watchers can read from the index. |
doc link: weaveworks/weave-gitops-private#128 |
The initiative https://www.notion.so/weaveworks/Simplify-Pipelines-to-improve-the-user-experience-and-enable-scalability-28eef20db2ea4e9bb72a596b8a99a899#c9a35a2c55114d0e802a9fb49e8215cd describes several problems (problems 1 and 5) that relying on webhook notifications give rise to:
All this adds up to: we need to implement polling, and treat webhook invocations as a trigger to poll the resource in question immediately. I think this is complicated enough that it's worth writing out a design.
There's previous work in this direction: #179 and PR #180. To recap here, this implementation
I think this approach is flawed on these counts:
HelmRelease
in every cluster, which seems it could be a lot of unnecessary workInstead, I suggest we should start with the pseudo-algorithm:
... then consider optimisations from there. For example, cluster-api has a client cache which could be used to make
HelmRelease
(and other "app" object) lookups less costly.Part of the design should be a mechanism for webhooks to trigger polls, so that it's still possible to make the system more responsive with notifications.
It may also be possible to address problem 7 from the initiative, since polling will have more scope to calculate success from the whole status, and not just what it's told in a notification.
The text was updated successfully, but these errors were encountered: