Skip to content

Availability - zero-downtime / zero-impact upgrades #18

Open
@manasdk

Description

@manasdk

(Forking from #75)

Another thing related to availability are zero-downtime / zero-impact (rolling) upgrades.

As far as code changes go, this means we need to add support for "draining" to all the components - this means letting services stop accepting new requests / stop consuming messages of the queue and finish processing requests / messages which are currently active (with some kind of hard limit / timeout so we don't dead lock). This should be relatively easy.

Another part is documenting how to deploy those components to achieve zero downtime / impact upgrades. This kinda ties into redundancy, but in general it means having multiple instances of each service running. During the upgrade scenario this means start a new instance of service which is to be upgraded with the code base so it starts processing new requests and draining the old one before replacing it with an instance which runs a new version. For the API service, it's a bit different and requires user to set up some kind of load balancer (e.g. haproxy / nginx / some kind of aaS thing) and also perform draining on the load balancer.

So in short, in addition to the code changes we will also some good docs:

  1. Operation guide - monitoring, how to check cluster and services health, dashboard, etc.
  2. Upgrade guide - how to perform zero impact rolling upgrades
  3. HA deployment guide - how to deploy StackStorm in HA manner (multiple instances of each service, how to configure load-balancing for API, etc.)
  4. Potentially some code to help with the rolling upgrades, but some of that code might be environment specific

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions