Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Failure Handling, #78

Open
noizu opened this issue May 29, 2018 · 1 comment
Open

Node Failure Handling, #78

noizu opened this issue May 29, 2018 · 1 comment

Comments

@noizu
Copy link

noizu commented May 29, 2018

  • Pooler will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers. (Running into problems on production with riak ts nodes periodically crashing due to GCE NVME local disk instability).

  • Depending on number of active workers (I have a cluster doing about a million riak writes per minute, and saw cascading failures with 2048 connections per node x 6 riak nodes duplicated across 5 elixir servers) node failure can cascade to halt pooler and the OTP tree.

  • In general are there any recommended strategies for handling group member failures gracefully. I could hook up process listeners for example and automate pool add/remove or something like that but if there is some possible mechanism to serve fewer connections from a group if it has a recent high failure rate would be nice if possible.

@seriyps
Copy link
Member

seriyps commented Apr 9, 2023

Not sure I fully understand the problem

will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers

what do you mean by "one of a group members is unavailable"? When start_mfa is blocking and does not return for a long time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants