Skip to content

Design for polling-first architecture #191

Closed
weaveworks/weave-gitops-private
#128
@squaremo

Description

@squaremo

The initiative https://www.notion.so/weaveworks/Simplify-Pipelines-to-improve-the-user-experience-and-enable-scalability-28eef20db2ea4e9bb72a596b8a99a899#c9a35a2c55114d0e802a9fb49e8215cd describes several problems (problems 1 and 5) that relying on webhook notifications give rise to:

  • you have to create notification resources in all downstream clusters (that you gate promotions on), which means larger templates, more room for mistakes, more permissions needed, etc.
  • the implementation uses information passed in the webhook URL and payload to make decisions, which makes it vulnerable to injection attacks;
  • if the server misses a notification, it now has the wrong state

All this adds up to: we need to implement polling, and treat webhook invocations as a trigger to poll the resource in question immediately. I think this is complicated enough that it's worth writing out a design.

There's previous work in this direction: #179 and PR #180. To recap here, this implementation

  • creates an HelmRelease watcher for every cluster
  • writes labels to each HelmRelease used in a pipeline
  • dispatches HelmRelease updates from the watcher by examining the labels

I think this approach is flawed on these counts:

  • it needs to be able to write to every downstream cluster
  • it relies on labels on the downstream objects for dispatch, and these could be changed
  • (it looks to me like) it unconditionally watches every HelmRelease in every cluster, which seems it could be a lot of unnecessary work

Instead, I suggest we should start with the pseudo-algorithm:

for each Pipeline
 for each Environment
   for each Target
     get a downstream cluster client if necessary
     retrieve the app status
  for each Environment[1:]
    calculate whether a promotion is indicated, and if so, invoke it

... then consider optimisations from there. For example, cluster-api has a client cache which could be used to make HelmRelease (and other "app" object) lookups less costly.

Part of the design should be a mechanism for webhooks to trigger polls, so that it's still possible to make the system more responsive with notifications.

It may also be possible to address problem 7 from the initiative, since polling will have more scope to calculate success from the whole status, and not just what it's told in a notification.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions