Design for polling-first architecture

The initiative https://www.notion.so/weaveworks/Simplify-Pipelines-to-improve-the-user-experience-and-enable-scalability-28eef20db2ea4e9bb72a596b8a99a899#c9a35a2c55114d0e802a9fb49e8215cd describes several problems (problems 1 and 5) that relying on webhook notifications give rise to:

 - you have to create notification resources in all downstream clusters (that you gate promotions on), which means larger templates, more room for mistakes, more permissions needed, etc.
 - the implementation uses information passed in the webhook URL and payload to make decisions, which makes it vulnerable to injection attacks;
 - if the server misses a notification, it now has the wrong state

All this adds up to: we need to implement polling, and treat webhook invocations as a trigger to poll the resource in question immediately. I think this is complicated enough that it's worth writing out a design.

There's previous work in this direction: https://github.com/weaveworks/pipeline-controller/issues/179 and PR https://github.com/weaveworks/pipeline-controller/pull/180. To recap here, this implementation
 - creates an HelmRelease watcher for every cluster
 - writes labels to each HelmRelease used in a pipeline
 - dispatches HelmRelease updates from the watcher by examining the labels

I think this approach is flawed on these counts:
 - it needs to be able to write to every downstream cluster
 - it relies on labels on the downstream objects for dispatch, and these could be changed
 - (it looks to me like) it unconditionally watches every `HelmRelease` in every cluster, which seems it could be a lot of unnecessary work

Instead, I suggest we should start with the pseudo-algorithm:

```
for each Pipeline
 for each Environment
   for each Target
     get a downstream cluster client if necessary
     retrieve the app status
  for each Environment[1:]
    calculate whether a promotion is indicated, and if so, invoke it
```

... then consider optimisations from there. For example, cluster-api has a client cache which could be used to make `HelmRelease` (and other "app" object) lookups less costly.

Part of the design should be a mechanism for webhooks to trigger polls, so that it's still possible to make the system more responsive with notifications.

It may also be possible to address problem 7 from the initiative, since polling will have more scope to calculate success from the whole status, and not just what it's told in a notification.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design for polling-first architecture #191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design for polling-first architecture #191

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions