You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While getting ready to update the Prefect workers in production with other recent updates, I noticed an issue with ALCF reconstructions where job requests to Globus Compute never returned a response. This prevented new reconstructions from being processed, with a long backlog of unprocessed scans. Fortunately, this did not disrupt other workflows on the server (new_832_file, pruning).
I am investigating further to see where the culprit lies, but my immediate guess is the concurrency (set to 10) is too high for the current number of nodes our globus-compute-endpoint requests (2 nodes).
Some things I think are worth testing:
Consider adding timeouts to compute tasks that we expect to take a certain amount of time (e.g., 15 minutes for reconstruction) so things can fail quicker.
Decrease concurrency in the short term to avoid this issue (set = 2)
Configure globus-compute-endpoint to request more nodes for scaling (maybe 4)
Add a globus-compute-endpoint preflight status check from the Prefect worker side
The text was updated successfully, but these errors were encountered:
Looking at the globus-compute-endpoint logs, it looks like there is no compute allocation available to IRIBeta:
stderr:qsub: Request rejected. Reason: No active allocation found for project IRIBeta and resource polaris
I no longer see the allocation on my.alcf.anl.gov, either. Part of my upgrade is to switch to the IRI-ALS-832 allocation I set up on Polaris, so this will be solved soon.
I patched the alcf flow agent container on flow-prd to use the IRI-ALS-832 allocation, and set the concurrency for the agent equal to 1 in Prefect. The backlog of reconstructions finished processing overnight, and it appears new scans are being reconstructed on demand once again.
I think my other points still stand that the flow should die quickly (timeout), and that a preflight check should ensure that the compute-endpoint is available. I can incorporate these in my next update to alcf.py.
While getting ready to update the Prefect workers in production with other recent updates, I noticed an issue with ALCF reconstructions where job requests to Globus Compute never returned a response. This prevented new reconstructions from being processed, with a long backlog of unprocessed scans. Fortunately, this did not disrupt other workflows on the server (new_832_file, pruning).
I am investigating further to see where the culprit lies, but my immediate guess is the concurrency (set to 10) is too high for the current number of nodes our globus-compute-endpoint requests (2 nodes).
Some things I think are worth testing:
The text was updated successfully, but these errors were encountered: