You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are able to find out how often a given action failure occurs via something automated
If the lost communication error happens regularly, we have a new issue created to address it
Implementation notes:
The "something automated" might be a script someone can run, or a dashboard, or something else entirely. Left for whoever picks this up.
If this is really hard, timebox to a day or two
Notes from the original creation of this ticket:
Github actions are occasionally failing with:
The self-hosted runner: edx-platform-openedx-ci-runner-deployment-7xdl7-zf8dj lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
One possible use of this ticket would just be to determine if this is indeed happening more often than it used to.
Possible avenues of exploration:
Are github workers resource-starved?
May be able to use GitHub API to get status of jobs, then command line to get logs (if this shows up in the logs)
The text was updated successfully, but these errors were encountered:
I started discovery on some dashboard tools for GitHub Actions (and other CI services) that might have some out of the box functionality for classifying errors that could help here: openedx/public-engineering#168 .
A/C
Implementation notes:
The "something automated" might be a script someone can run, or a dashboard, or something else entirely. Left for whoever picks this up.
If this is really hard, timebox to a day or two
Notes from the original creation of this ticket:
Github actions are occasionally failing with:
The self-hosted runner: edx-platform-openedx-ci-runner-deployment-7xdl7-zf8dj lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
This will cause final_checks_before_prod to fail.
Occurrences:
One possible use of this ticket would just be to determine if this is indeed happening more often than it used to.
Possible avenues of exploration:
Are github workers resource-starved?
May be able to use GitHub API to get status of jobs, then command line to get logs (if this shows up in the logs)
The text was updated successfully, but these errors were encountered: