-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
edx-platform unit tests error: Self-hosted runner loses communication with the server #32671
Comments
Sharing related issue: actions/actions-runner-controller#466 |
@rgraber: Is this something you were or were not able to see using your script to gather errors from github? If we can, maybe we could link to some of that information? |
There may be good info available via the above issue. This issue may have details for how to query for this? int128/datadog-actions-metrics#445. It would be interesting to compare to Becca's script. |
Notes:
|
Info for the failure screenshotted in the ticket:
Searching that pod for lines containing "job" turns up The runner was actually active at about that time, but exited and restarted shortly after:
...so it probably was assigned the job, but never picked it up or something. (I'm hazy on how work assignment is done with these.) There could also be some time skew. But it did claim to exit normally, which is surprising. Searching for references to the pod around that time ( ...and that's about it. The autoscaler mentions it a few minutes earlier as a pod that isn't a candidate for scale-down, and the controller doesn't have any nearby messages about errors or unreachability or anything. |
I wanted to know how often we've had CD failures due to this, so I wrote some shell scripts to call the GoCD API and queried the job logs:
Then I was able to build a histogram of cancelled checks by month:
Update 2023-08-28: See later comments on problems with this method. |
Given that this seems to have died down somewhat, we may want to just close this for now. |
Possible further actions:
|
It turns out that looking at the GoCD logs alone isn't diagnostic, since the job that failed due to lost communication with the server is marked I also tried looking at the GitHub action errors using our script in edx-arch-experiments, but it doesn't help -- that script currently only looks at the most recent attempt of any Action, and we almost always end up re-running CI checks when they fail on master. So that script would need to be modified if we wanted to use it to detect this. |
|
I agree that it would be really useful to be able to get the error messages from previous attempts, just in general for our ability to look at error frequency. But I think the GoCD errors are enough to show that this error isn't occurring as much as it was, so I'm not sure it's worth it for this ticket specifically. |
Maybe we should ticket the fix to the script, which will probably go on the backlog, and link to this ticket and icebox it? |
Created edx/edx-arch-experiments#437 for grooming; will icebox this one. |
A/C:
Since roughly June 2023 we've been seeing a lot of CI failures on GitHub for edx-platform where the job is just shown as "Job failed" with no logs. But if you check the Summary view (link in the upper left) and look at the errors at the bottom of the page, the following error shows up:
These have been causing development and deployment delays. We'll need to check if there's something wrong with the runners.
The text was updated successfully, but these errors were encountered: