-
Notifications
You must be signed in to change notification settings - Fork 347
cml runner
aggressively shutting down instance with active job running
#1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Im pretty sure is fixed by #1030 |
Tried this branch with the following workflow and Workflow:
`Package.json:
|
@danieljimeneznz If you can open the GCP web console edit the cml instance check this box to prevent the instance destruction: |
@danieljimeneznz I see, I was thinking that the runner was not honouring the job coming but what is happening is that your code probably is OOM is not the train job displaying any log? remember that the logs in the train job should be accesible. If not as @dacbd says keeping the machine should be enough to verify whats going on. The runners logs resides at the installation folder. In your example was at |
Just to confirm some things,
some helpful commands you can use:
shouldn't be needed but:
|
Instance shutdown was occurring usually at 5 minutes, then the GitHub action/workflow would hang with an infinite spinning yellow circle, that being said: The strangest thing, I went to do some investigation today to try and determine what the cause of the issue was, however everything seems to be working fine this morning and no matter what I do, I can't seem to break it again... 🤷 I tried the following things:
My best guess at the issue is that the runner was hitting 403 authentication errors against the GitHub API, which could mean that the termination logic of the Iterative Terraform provider gets triggered on the GitHub action side causing the runner to shutdown. But based on all the changes I tried above, I'm not so sure this was the case. The final workflow we landed on that works quite well is:
For completeness sake, @dacbd Here is the output that my colleague @TessaPhillips was able to capture with the suggestions you made (enabling delete protection + running the job without container and with
|
@danieljimeneznz Thanks for the detailed response, the unreliable-ness and the 5ish minute nature of it makes me think this is the latest incarnation of the cursed ghost of This should not happen with the refactor the timeout logic in #1030 (note that your first attempt to use the mentioned branch only applied to the invoking cml runner and not the instances version cml runner) if it occurs again and you are up to it, you can fork cml litter the code with logging everywhere, to use a custom branch/repo on the instance use:
with your use of The |
If I see it happen again I'll do a deep dive and try to find the underlying cause! Thanks for the info about the branch/API calls. Had a quick read through of the current code in that PR (#1030), the logic seems sound enough to me. I wonder if using Lines 286 to 291 in 9b0a217
^With the logic above and combining that with the logs I posted, the One other thing that I'm curious about though (which isn't necessarily related to this issue): For the following lines, won't a job on GitHub not be able to run for longer than 1 hour? (if you don't provide the Lines 297 to 322 in 9b0a217
|
If initialization takes too long then the All the operations in the
I don't follow, both sets of logs show the first job picked up within 10 seconds of the GitHub Action client starting?
If I understand your question correctly, no that is not the case, but perhaps Line 17 in 9b0a217
|
Ah didn't see that the
Ah yep - realized that in #1030 the
Yeah probably a misnomer - I'm curious where the 72hr timeout came from? The usage limits show 35 days for a maximum workflow runtime - did they maybe change this timeout or does it come from somewhere else? Also feel free to close this issue, can open another one and reference this (and the ghost of #808) if it happens again - hopefully with a better investigation/potential solution! 😄 |
It does appear that they have made changes for self-hosted as well as their provided runners 🙃 |
@danieljimeneznz feel free to join our Discord if you have more questions not directly related to an issue |
Similar to #808, we have been seeing our GCP VM instance shutdown randomly in the first few minutes even though a job is still running (we noticed that the GitHub pull, etc. starts on the runner so I don't think it's an authentication based problem from the runner) log below from the VM. We have been trying to get CML working on an
n2-standard-4
(slowly beefing up the server) which has 4 vCPU's, along with 16GB ram, a 50GB HDD, and 10GBps network. Is there anyway to debug what might be causing the issues on the VM (i.e. a flag that will stop the aggresive instance destruction so that we can debug this).Workflow File:
Output Logs:
GitHub Actions Hanging:
My colleague @TessaPhillips, and I have been trying to figure this out to no avail - more than happy to contribute if you can point us in the right direction!
The text was updated successfully, but these errors were encountered: