✅ During rolling maintenance and/or planned cluster resizing, the nodes' state and count will be changing.
Operators can mute alerts described below during routine maintenance procedures to avoid unnecessary distractions.
Monitor the cluster health for early signs of instability.
liveness.heartbeatlatency
If this metric exceeds 1 sec, it's a sign of instability. The recommended alert rule: warning if 0.5 sec, critical if 3secs
Tier | Definition |
---|---|
WARNING | |
CRITICAL |
< TODO >
Live node count change
The liveness checks reported by a node is inconsistent with the rest of the cluster.
Inconsistent Liveness check
Tier | Definition |
---|---|
WARNING | max cluster (liveness.livenodes) - min (liveness.livenodes) > 0 for 2 minutes |
CRITICAL | max cluster (liveness.livenodes) - min (liveness.livenodes) > 0 for 5 minutes |
The actual response varies depending on the alert tier, i.e. the severity of potential consequences.
-
Check ....
doing regular maintenance (upgrade, rehydrate, ...) you will also get these messages, so you need to control your monitoring alarms during maintenance.