Skip to content

Proposed design to address Polling Sensor HA #21

Open
@vivekdatar

Description

@vivekdatar

1 Problem Description

See StackStorm/st2#4301 for problem description. Refer to https://docs.stackstorm.com/reference/ha.html for StackStorm HA design and block diagram.

Summary is as follows.

StackStrom polling sensors are not HA aware. Polling sensor will poll without any knowledge of HA, meaning if there is another polling sensor doing the same job in HA mode then both will continue to poll independently. This is an issue when running HA in active-active mode. If customers split the polling sensors in HA mode then the duplicity is avoided, and system runs fine, at the cost of single point of failure for each polling sensor.

In active-active HA mode duplicate polling sensors can create duplicate events, which can cause issues. Goal of this design is to make sure that in active-active HA mode only one polling sensor polls at a given time; other polling sensor will start polling if first one fails for some reason. Note that there can be multiple polling sensors, out of which only one should poll at any given time.

2 Design Considerations/Assumptions

Design assumes that st2sensorcontainer and st2actionrunner are in same node (i.e. VM) in HA configuration.

A central broker entity is needed to co-ordinate multiple instances of polling sensors running on different blueprint boxes. This broker should allow sensors to register themselves with the broker. In that sense the broker is similar to zookeeper.

3 Block Diagram

Screen Shot 2020-05-14 at 9 54 38 AM

3.1 Sensor Arbitrator

Sensor Arbitrator (st2sa) is the “brain” that controls all the HA activated polling sensors. It provides the following functionality

  • Allows HA enabled polling sensors to register themselves
  • Manages heart-beat messages from all HA enabled sensors
  • Chooses “operation” sensor from list of all the registered sensors that perform same polling function
  • Informs all the agents regarding their current status (“operational”, meaning they should poll, “standby”)
  • Chooses different operational sensor if the current operation sensor fails. Three subsequent heartbeat misses implies faiure.
  • Performs adequate logging of all its operations for debugging/logging purposes

This is new code development.

3.2 HA Enabled Polling Sensors

Current code for polling sensors will be modified for HA operation as follows. Note the code will be written such that non HA mode operation remain exactly same.

  • Polling Sensor will first register with st2sa
  • Upon successful registration it will start periodic heart-beat messages.
  • Polling Sensor will be informed by st2sa regarding its status. This information can come as a separate message, or as part of keep alive response. Each keep alive response will contain the status information
  • Once the sensor is instructed to poll (status = “operational”) it will start the poll. No change in polling functionality
  • Upon restart the polling sensor will go through registration process once again. Meaning it will not start polling till registration is complete and it receives “status=operational” from st2sa.

This is modification of existing polling sensor code. The modification should be modular such that there should be minimal to zero impact on existing code. We cannot introduce regression issues into existing polling functionality in non HA mode.

For example, in python code, the HA should be implemented as separate set of functionality, which gets invoked only when HA is enabled. And existing code should be changed minimally.

4 Lock v/s No Lock Tradeoff

Problem can be solved by either locking or without locking. Locking scheme will involve locks to be acquired by individual polling sensors, with some kind of lease timeout. After lease timeout kicks in the lock has to be reacquired. This scheme is not recommended for the following reasons

  • Each sensor has to perform lock, which means more logic in sensors. This “spreads” the logic across 100s or maybe 1000s of sensors. Easier if this logic is in a central place like st2sa
  • Locking is inherently difficult to debug in timing situations that happen “at scale” and invariably in larger deployments. I have personally seen several such locking issues, which we could never reproduce in lab. These issues only happen in field, and take long to debug
  • Locking scheme is difficult to scale

Therefore locking is not advisable. When most of the critical logic is in one place (st2sa) it is easier to trace logs and debug issues. Further, st2sa can be further enhanced by writing logic to sweep through all sensors, ensuring their health. Also in case Controller Box dies and some other box takes over, heatbeat mechanism allows for sensors to register with new st2sa, and correct polling operation would resume in short period of time (although new st2sa may elect a different sensor than previous one)

5 Timing Diagrams

5.1 Diagram 1

SensorHA

Explanation of state and status
“state”: is maintained internally by sensor, it cannot be programmed externally, it can be read by other entities.

“status”: is programmable element for sensor. It can be programmed by Arbitrator.

Sensor Registration & Response

  1. As sensors comes up, it detects HA mode
  2. Sensor sends Register Sensor message to Arbitrator
  3. Once it received Success, it start sending Keep Alive messages periodically
  4. Each Keep Alive Response contains Sensor “status” information. Sensor compares the requested status with current state and acts accordingly.
    Asynchronous Status Change by Arbitrator
  5. Arbitrator can decide to send an asynchronous status change message any time without waiting for Keep Alive
  6. This message will send new status information to Sensor
  7. Sensor will act on it and change its state if need be

Polling Sensor HA Design Document.docx

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions