Node Failure Handling, #78

noizu · 2018-05-29T06:59:26Z

Pooler will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers. (Running into problems on production with riak ts nodes periodically crashing due to GCE NVME local disk instability).
Depending on number of active workers (I have a cluster doing about a million riak writes per minute, and saw cascading failures with 2048 connections per node x 6 riak nodes duplicated across 5 elixir servers) node failure can cascade to halt pooler and the OTP tree.
In general are there any recommended strategies for handling group member failures gracefully. I could hook up process listeners for example and automate pool add/remove or something like that but if there is some possible mechanism to serve fewer connections from a group if it has a recent high failure rate would be nice if possible.

seriyps · 2023-04-09T00:39:29Z

Not sure I fully understand the problem

will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers

what do you mean by "one of a group members is unavailable"? When start_mfa is blocking and does not return for a long time?

Provide feedback