You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.
I think the cleanest way to solve this is to have some simple retry logic in the def connect function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.
This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).
At a higher level, is using the server really worth all the additional overhead of its instabilities....
The text was updated successfully, but these errors were encountered:
The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.
I think the cleanest way to solve this is to have some simple retry logic in the
def connect
function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).
At a higher level, is using the server really worth all the additional overhead of its instabilities....
The text was updated successfully, but these errors were encountered: