Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling variables to control job batch limits #103

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mramireztgtg
Copy link

We've been encountering recurring issues with some AWS Batch queues getting stuck in a Runnable state. This seems to be caused by resource availability constraints or potential misconfigurations in our workflows. To address this, I'm implementing a Job State Limit for these jobs in the metaflow-computation module. The AWS documentation allows us to extend the capabilities of this resource
Here’s the approach:

  • Introduce an optional variable that allows us to configure the timeout and define the action to take when a job exceeds the allowed state duration.
    The benefits of this change are clear:
  • It would prevent jobs from running excessively long (In our case we had a run 40+ hours over weekends), which has been blocking new executions and impacting production workloads.
  • It addresses an issue several teams have mentioned in Slack threads, making it a valuable improvement for our broader community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant