Handle server startup time post-crashes with simple retry logic to find connection #32

coltonbh · 2022-04-28T02:51:41Z

The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.

I think the cleanest way to solve this is to have some simple retry logic in the def connect function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.

This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).

At a higher level, is using the server really worth all the additional overhead of its instabilities....

The text was updated successfully, but these errors were encountered:

coltonbh added the enhancement New feature or request label Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle server startup time post-crashes with simple retry logic to find connection #32

Handle server startup time post-crashes with simple retry logic to find connection #32

coltonbh commented Apr 28, 2022

Handle server startup time post-crashes with simple retry logic to find connection #32

Handle server startup time post-crashes with simple retry logic to find connection #32

Comments

coltonbh commented Apr 28, 2022