Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle server startup time post-crashes with simple retry logic to find connection #32

Open
coltonbh opened this issue Apr 28, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@coltonbh
Copy link
Contributor

The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.

I think the cleanest way to solve this is to have some simple retry logic in the def connect function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.

This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).

At a higher level, is using the server really worth all the additional overhead of its instabilities....

@coltonbh coltonbh added the enhancement New feature or request label Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant