Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globus Compute at ALCF -- add timeout to kill jobs #45

Open
davramov opened this issue Dec 10, 2024 · 2 comments
Open

Globus Compute at ALCF -- add timeout to kill jobs #45

davramov opened this issue Dec 10, 2024 · 2 comments

Comments

@davramov
Copy link
Contributor

While getting ready to update the Prefect workers in production with other recent updates, I noticed an issue with ALCF reconstructions where job requests to Globus Compute never returned a response. This prevented new reconstructions from being processed, with a long backlog of unprocessed scans. Fortunately, this did not disrupt other workflows on the server (new_832_file, pruning).

I am investigating further to see where the culprit lies, but my immediate guess is the concurrency (set to 10) is too high for the current number of nodes our globus-compute-endpoint requests (2 nodes).

Some things I think are worth testing:

  • Consider adding timeouts to compute tasks that we expect to take a certain amount of time (e.g., 15 minutes for reconstruction) so things can fail quicker.
  • Decrease concurrency in the short term to avoid this issue (set = 2)
  • Configure globus-compute-endpoint to request more nodes for scaling (maybe 4)
  • Add a globus-compute-endpoint preflight status check from the Prefect worker side
@davramov
Copy link
Contributor Author

Looking at the globus-compute-endpoint logs, it looks like there is no compute allocation available to IRIBeta:

stderr:qsub: Request rejected. Reason: No active allocation found for project IRIBeta and resource polaris

I no longer see the allocation on my.alcf.anl.gov, either. Part of my upgrade is to switch to the IRI-ALS-832 allocation I set up on Polaris, so this will be solved soon.

@davramov
Copy link
Contributor Author

I patched the alcf flow agent container on flow-prd to use the IRI-ALS-832 allocation, and set the concurrency for the agent equal to 1 in Prefect. The backlog of reconstructions finished processing overnight, and it appears new scans are being reconstructed on demand once again.

I think my other points still stand that the flow should die quickly (timeout), and that a preflight check should ensure that the compute-endpoint is available. I can incorporate these in my next update to alcf.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant