Description
The initiative https://www.notion.so/weaveworks/Simplify-Pipelines-to-improve-the-user-experience-and-enable-scalability-28eef20db2ea4e9bb72a596b8a99a899#c9a35a2c55114d0e802a9fb49e8215cd describes several problems (problems 1 and 5) that relying on webhook notifications give rise to:
- you have to create notification resources in all downstream clusters (that you gate promotions on), which means larger templates, more room for mistakes, more permissions needed, etc.
- the implementation uses information passed in the webhook URL and payload to make decisions, which makes it vulnerable to injection attacks;
- if the server misses a notification, it now has the wrong state
All this adds up to: we need to implement polling, and treat webhook invocations as a trigger to poll the resource in question immediately. I think this is complicated enough that it's worth writing out a design.
There's previous work in this direction: #179 and PR #180. To recap here, this implementation
- creates an HelmRelease watcher for every cluster
- writes labels to each HelmRelease used in a pipeline
- dispatches HelmRelease updates from the watcher by examining the labels
I think this approach is flawed on these counts:
- it needs to be able to write to every downstream cluster
- it relies on labels on the downstream objects for dispatch, and these could be changed
- (it looks to me like) it unconditionally watches every
HelmRelease
in every cluster, which seems it could be a lot of unnecessary work
Instead, I suggest we should start with the pseudo-algorithm:
for each Pipeline
for each Environment
for each Target
get a downstream cluster client if necessary
retrieve the app status
for each Environment[1:]
calculate whether a promotion is indicated, and if so, invoke it
... then consider optimisations from there. For example, cluster-api has a client cache which could be used to make HelmRelease
(and other "app" object) lookups less costly.
Part of the design should be a mechanism for webhooks to trigger polls, so that it's still possible to make the system more responsive with notifications.
It may also be possible to address problem 7 from the initiative, since polling will have more scope to calculate success from the whole status, and not just what it's told in a notification.