Globus Compute at ALCF -- add timeout to kill jobs #45

davramov · 2024-12-10T00:07:01Z

While getting ready to update the Prefect workers in production with other recent updates, I noticed an issue with ALCF reconstructions where job requests to Globus Compute never returned a response. This prevented new reconstructions from being processed, with a long backlog of unprocessed scans. Fortunately, this did not disrupt other workflows on the server (new_832_file, pruning).

I am investigating further to see where the culprit lies, but my immediate guess is the concurrency (set to 10) is too high for the current number of nodes our globus-compute-endpoint requests (2 nodes).

Some things I think are worth testing:

Consider adding timeouts to compute tasks that we expect to take a certain amount of time (e.g., 15 minutes for reconstruction) so things can fail quicker.
Decrease concurrency in the short term to avoid this issue (set = 2)
Configure globus-compute-endpoint to request more nodes for scaling (maybe 4)
Add a globus-compute-endpoint preflight status check from the Prefect worker side

davramov · 2024-12-10T00:17:08Z

Looking at the globus-compute-endpoint logs, it looks like there is no compute allocation available to IRIBeta:

stderr:qsub: Request rejected. Reason: No active allocation found for project IRIBeta and resource polaris

I no longer see the allocation on my.alcf.anl.gov, either. Part of my upgrade is to switch to the IRI-ALS-832 allocation I set up on Polaris, so this will be solved soon.

davramov · 2024-12-11T18:31:00Z

I patched the alcf flow agent container on flow-prd to use the IRI-ALS-832 allocation, and set the concurrency for the agent equal to 1 in Prefect. The backlog of reconstructions finished processing overnight, and it appears new scans are being reconstructed on demand once again.

I think my other points still stand that the flow should die quickly (timeout), and that a preflight check should ensure that the compute-endpoint is available. I can incorporate these in my next update to alcf.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globus Compute at ALCF -- add timeout to kill jobs #45

Globus Compute at ALCF -- add timeout to kill jobs #45

davramov commented Dec 10, 2024

davramov commented Dec 10, 2024

davramov commented Dec 11, 2024

Globus Compute at ALCF -- add timeout to kill jobs #45

Globus Compute at ALCF -- add timeout to kill jobs #45

Comments

davramov commented Dec 10, 2024

davramov commented Dec 10, 2024

davramov commented Dec 11, 2024