Skip to content

fix: Increase GHA workflow timeout to 35 days from 72h #1067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 7, 2022
Merged

fix: Increase GHA workflow timeout to 35 days from 72h #1067

merged 5 commits into from
Jul 7, 2022

Conversation

danieljimeneznz
Copy link
Contributor

@danieljimeneznz danieljimeneznz commented Jun 18, 2022

Noticed that there aren't really any tests for this in the runner - can potentially write one, or run cml-runner off this branch if you're happy for the PR to be open 35 days. 😂

Originally mentioned in #1054 (comment) - fixes #1064

@casperdcl casperdcl temporarily deployed to external June 20, 2022 09:08 Inactive
Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; just pushed a minor change.

wdyt @iterative/cml? Technically this PR is a breaking change for people with hardcoded timeout-minutes: 4320 :/

@danieljimeneznz
Copy link
Contributor Author

lgtm; just pushed a minor change.

wdyt @iterative/cml? Technically this PR is a breaking change for people with hardcoded timeout-minutes: 4320 :/

btw @casperdcl updated the variable name based on a brief discussion in #1054 (comment) as it's probably a little bit of a misnomer - if you think it's clearer, happy for it to be the name in 7ba6e59

@dacbd
Copy link
Contributor

dacbd commented Jun 20, 2022

wdyt @iterative/cml? Technically this PR is a breaking change for people with hardcoded timeout-minutes: 4320 :/

perhaps an update/note in the readme as well as a docs ones to go with this.

@dacbd
Copy link
Contributor

dacbd commented Jun 23, 2022

@danieljimeneznz for transparency we have been discussing this internally as it has the potential to break several user workflows.

We are still considering the best approach but I believe that we are currently thinking of having this as a configurable option or parsing the user's workflow to check for the time value when the cml runner command is invoked.

@danieljimeneznz
Copy link
Contributor Author

@danieljimeneznz for transparency we have been discussing this internally as it has the potential to break several user workflows.

We are still considering the best approach but I believe that we are currently thinking of having this as a configurable option or parsing the user's workflow to check for the time value when the cml runner command is invoked.

Yeah I think that's probably best, when I was looking through the Github docs and saw the timeout value it makes this logic a little redundant. I reckon parsing their timeout at workflow runtime is probably the way to go.

Not sure I agree that it would break people's workflows though - was going to test that theory by setting a workflow to timeout after 5 minutes via the workflow file to see whether it would shutdown after being caught idle for 5 mins.

@dacbd
Copy link
Contributor

dacbd commented Jun 24, 2022

This timeout has nothing to do with being idle. But long runnering workflows took longer than GitHub Actions originally allowed. Some long-running jobs have been built to save intermediate data so they can be resumed by a restart. If we change the restart mechanism to 35-days but they still have the 3-day value hard coded, actions will stop the job, and the restart mechanism won't go off.

@dacbd
Copy link
Contributor

dacbd commented Jun 29, 2022

@danieljimeneznz we want to action this so that your contribution isn't hanging in limbo for too long, do you have any thoughts on my previous comment?

@danieljimeneznz
Copy link
Contributor Author

danieljimeneznz commented Jun 30, 2022

@danieljimeneznz we want to action this so that your contribution isn't hanging in limbo for too long, do you have any thoughts on my previous comment?

Apologies I was away on holiday and have only just been catching up on things.

This timeout has nothing to do with being idle. But long runnering workflows took longer than GitHub Actions originally allowed. Some long-running jobs have been built to save intermediate data so they can be resumed by a restart. If we change the restart mechanism to 35-days but they still have the 3-day value hard coded, actions will stop the job, and the restart mechanism won't go off.

I might have a bit of a simplistic view on this so wanted to run a test to confirm my theory but when GitHub reaches the timeout specified in the workflow file does it not send the runner a command to stop the currently running job? At which point the idle timer would kick in and the runner would decommission within 5 minutes.

I'm not following along with the logic of the 3-day hard coded intermediate value for the restart? Unless you're meaning that the current runner ignores Github action's request to stop the job (as a workaround to support jobs running longer than Github allows)? (which seems a little dangerous).

Feel free to close the PR if it's left dangling for too long - happy to help with the implementation of parsing the workflow timeout value.

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 4, 2022

@danieljimeneznz we are going to merge but please can you warn the users that the time limit has been updated?
Within the run function

 winston.warn( 'Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically.');

@danieljimeneznz danieljimeneznz temporarily deployed to external July 5, 2022 09:09 Inactive
@danieljimeneznz
Copy link
Contributor Author

@danieljimeneznz we are going to merge but please can you warn the users that the time limit has been updated? Within the run function

@DavidGOrtega have added the log line.

@danieljimeneznz danieljimeneznz temporarily deployed to external July 5, 2022 09:28 Inactive
@dacbd dacbd requested a review from casperdcl July 5, 2022 15:59
@dacbd dacbd self-requested a review July 5, 2022 15:59
Copy link
Contributor

@DavidGOrtega DavidGOrtega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 ❤️

@DavidGOrtega DavidGOrtega merged commit a2be2b5 into iterative:master Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runner/github: increase self-hosted timeout to 35days
5 participants