Skip to content

[2015] Drain StackStorm & Rolling Upgrade #13

Open
@manasdk

Description

@manasdk

Problem formulation

At the highest level StackStorm needs to be able to support the following -

  1. red-black / blue-green upgrade but basically http://cloudnative.io/blog/2015/02/the-dos-and-donts-of-bluegreen-deployment/
  2. Rolling upgrade
  3. Safe content upgrade on the box.

The actual StackStorm upgrade or content update is not a concern for StackStorm platform. It will be managed by deployment tools that sit outside of StackStorm (whew!). Well then aren't we done; can we go home and play video games yet? Well, no.

StackStorm keeps performing work therefore any external system to be able to handle (1) or (2) correctly StackStorm need to enter a drain state (https://en.wikipedia.org/wiki/drain). Once having successfully entered this state rolling upgrade or content can be upgraded on the box.

Definition of work in the context of ST2

It is best to agree on the definition of work. Following fall under the broad category of work -

  1. Running any execution (action or workflow)
  2. Triggering timers
  3. Accepting webhooks
  4. Accepting TriggerInstances from any sensor
  5. Internal triggerinstances like actiontrigger, notifytrigger, kvupdate, sensorupdate, sensorexit etc.

Define Drained StackStorm

  1. Accept new work but do not perform new work (this really depends on the case)
  2. Existing work must continue to completion
  3. When/If system is take out of drained state pick-up all the queued up work.

Use-case: Red-Black upgrade

This is specifically to support the NFLX use-case.

Typical StackStorm deployment is -

  • 3 identical StackStorm boxes
  • Independent RabbitMQ (no cluster)
  • Shared DB
  • Shared content
  • 1 sensor per node with all sensor reading off an SQS queue.

Process

  1. start a new upgraded st2 instance and make it part of the cluster i.e. connect sensor to the same Q
  2. drain the red node and wait for it to complete work. In this case it is safe for the sensor to simply quit and stop accepting any work.
  3. A way to identify that the red node is no longer performing any work and can be safely shutdown
  4. continue updating other nodes
  5. As 1 node drains or stop accepting any more work other nodes will continue to perform work and therefore must there will be no downtime. It is important to keep this state.

Use-case: Rolling upgrade

Again a multi-node setup where the upgrade of bits happen on every node.

A deployment would be as follows -

  • More than 1 StackStorm nodes. Lets assume there are 3.
  • Each node run some or all StackStorm services.
  • DB is shared
  • RabbitMQ is shared

Process

  1. Drain node 1 (stop accepting work)
  2. Other node continue to perform work.
  3. All ActionRunners can be immediately shutdown following the completion of the current execution.
  4. Notify or make drained state queriable
  5. Services are stopped
  6. Bits are upgraded
  7. Services are restarted and work starts again

Use-case: Content upgrade

Process

  1. drain StackStorm and wait for all work to be drained.
  2. It is ok for sensor to accept trigger-instances or for new executions to be accepted so long as they are queued for later execution. (we may want to relax this requirement)
  3. Have a way to identify that work is entirely drained.
  4. replace and register content
  5. un-drain the system

Note : Likely that some old execution or trigger-instances that are queued up are no longer valid. This make continuing to accept new work somewhat hairy.

Design and Implementation

The actual Quiescing has separate meaning depending on context -

  1. red-black upgrade - no new work is accepted
  2. content upgrade - work is accepted and queued for later execution

Enter drain state

Regardless of these nuances we can definitely see that there need to be a way to Signal all process that they need to -

  1. stop performing new work
  2. potentially queue up new work or completely reject

SIGUSR2 could be a good way to notify all process. However, the down side is we cannot pass in rich information to the processes that they need to stop performing work. Also, the assumption that each stackstorm process is on the same box is an ok assumption for NFLX but not everywhere else - not even in our our deployments.

Control channel built on rabbitmq which can be used to send rich messages to all processes is an option. This would be at the highest level be triggered by an API call e.g. POST @ /admin/state.
This is likely somewhat non-standard and might lead to confusion but since StackStorm is a system of co-operative process with no machine affinity this might be our only option. The benefit of a rich control channel is for many future extensions.

Identify completion

It is important to know when work is complete. Knowing when no new execution is the only point a user can proceed to the next in upgrade or content update.

All executions are easy to track and identify completion. However, knowing when even the whole rule chain (like st2build002) will not fire any longer is done is also important to track. This way we would not end up with pipelines in a half-baked state. (Likely we make some simplifying assumptions)

Exit drain

In case of content upgrade another signal and in case of red-black upgrade new services will be started and old one removed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions