edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

timmc-edx · 2023-07-06T17:38:01Z

A/C:

We no longer get this error
If fixing the error is prohibitively difficult, raise again for discussion of alternative approaches

Since roughly June 2023 we've been seeing a lot of CI failures on GitHub for edx-platform where the job is just shown as "Job failed" with no logs. But if you check the Summary view (link in the upper left) and look at the errors at the bottom of the page, the following error shows up:

The self-hosted runner: edx-platform-openedx-ci-runner-deployment-7xdl7-twcfc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

These have been causing development and deployment delays. We'll need to check if there's something wrong with the runners.

robrap · 2023-07-06T17:50:40Z

Sharing related issue: actions/actions-runner-controller#466

robrap · 2023-08-15T20:50:39Z

@rgraber: Is this something you were or were not able to see using your script to gather errors from github? If we can, maybe we could link to some of that information?

robrap · 2023-08-15T21:02:21Z

There may be good info available via the above issue. This issue may have details for how to query for this? int128/datadog-actions-metrics#445. It would be interesting to compare to Becca's script.

timmc-edx · 2023-08-17T19:49:24Z

Notes:

This runner is not the github-actions-runner pods in our regular tools EKS cluster, but rather the edx-platform-openedx-ci-runner-deployment pods in the -automations cluster.
I don't think the logs in that cluster are being ingested into Splunk. Filed Ingest tools-automations EKS logs into Splunk edx/edx-arch-experiments#407 for that.
We do appear to have the logs in New Relic, under "cluster_name":"tools-edx-eks-automations" "namespace_name":"actions-runner-system-openedx". Not sure yet if they're useful.

timmc-edx · 2023-08-22T22:07:47Z

Info for the failure screenshotted in the ticket:

Failed attempt on GitHub: https://github.com/openedx/edx-platform/actions/runs/5476653975/attempts/1 (python-3.8,django-pinned,lms-5)
Runner's name was edx-platform-openedx-ci-runner-deployment-7xdl7-twcfc
Based on when another job in the same workflow was cancelled (as a knock-on effect) the failure would have happened shortly before Thu, 06 Jul 2023 14:45:09 GMT
New Relic log query for that pod between 14:00 and 15:00 UTC: https://one.newrelic.com/logger?account=88178&begin=1688652000000&end=1688655600000&state=6937ce24-aa2f-d1f0-b94e-f46bf9d54eb6

Searching that pod for lines containing "job" turns up 2023-07-06T14:43:44.510470685Z stdout F 2023-07-06 14:43:44Z: Running job: python-3.8,django-pinned,lms-5 which matches the job name, but not the time; the jobs should all have started at about 14:32:31. In fact, a query for that workflow job name across all runners ("namespace_name":"actions-runner-system-openedx" "python-3.8,django-pinned,lms-5") doesn't show any pod receiving that job at the expected time.

The runner was actually active at about that time, but exited and restarted shortly after:

2023-07-06T14:24:59.016438764Z stdout F 2023-07-06 14:24:59Z: Running job: python-3.8,django-pinned,openedx-3
2023-07-06T14:32:41.821472919Z stdout F 2023-07-06 14:32:41Z: Job python-3.8,django-pinned,openedx-3 completed with result: Succeeded
2023-07-06T14:32:41.910505073Z stdout F Runner listener exited with error code 0
2023-07-06T14:32:41.910546161Z stdout F Runner listener exit with 0 return code, stop the service, no retry needed.
2023-07-06T14:43:38.524140086Z stderr F �[0;97mDocker wait check skipped. Either Docker is disabled or the wait is disabled, continuing with entrypoint�[0m 
2023-07-06T14:43:38.524202006Z stderr F �[0;97mGithub endpoint URL https://github.com/�[0m
2023-07-06T14:43:39.000094928Z stderr F �[0;97mConfiguring the runner.�[0m

...so it probably was assigned the job, but never picked it up or something. (I'm hazy on how work assignment is done with these.) There could also be some time skew. But it did claim to exit normally, which is surprising.

Searching for references to the pod around that time ("cluster_name":"tools-edx-eks-automations" "message":"*7xdl7-twcfc*" in 14:25 to 14:45) I find that tools-auto-actions-runner-controller-openedx-58fb45666d-w7lxx deleted the runner pod: 2023-07-06T14:32:42.533768539Z stderr F 2023-07-06T14:32:42.533Z INFO actions-runner-controller.runner Deleted runner pod {"runner": "actions-runner-system-openedx/edx-platform-openedx-ci-runner-deployment-7xdl7-twcfc", "repository": "openedx/edx-platform"}.

...and that's about it. The autoscaler mentions it a few minutes earlier as a pod that isn't a candidate for scale-down, and the controller doesn't have any nearby messages about errors or unreachability or anything.

timmc-edx · 2023-08-23T19:03:01Z

I wanted to know how often we've had CD failures due to this, so I wrote some shell scripts to call the GoCD API and queried the job logs:

Exported variables GOCD_TOKEN (a personal token generated in the GoCD UI) and GOCD_SERVER (https://domain)
Retrieved the run history for the check_ci job: ./job-history.sh final_checks_before_prod/check_ci/check_ci_job_edx_platform > ~/cached/gocd/check_ci-history-20230823.jsonl
Fetch job log contents and identify the ones that saw a cancelled check: ./job-identify-matching-failures.sh ~/cached/gocd/check_ci-history-20230823.jsonl ~/cached/gocd/artifacts '/check-runs/[0-9]+ cancelled$' > ~/cached/gocd/check_ci-cancelled-20230823.jsonl

Then I was able to build a histogram of cancelled checks by month:

$ cat ~/cached/gocd/check_ci-cancelled-20230823.jsonl | while IFS= read -r json; do date -d @$(echo "$json" | jq .job_run.scheduled_date | head -c-4) --utc +"%Y-%m"; done | sort | uniq -c
      1 2023-01
      6 2023-02
      2 2023-03
      2 2023-04
      7 2023-05
     14 2023-06
     12 2023-07
      3 2023-08

Update 2023-08-28: See later comments on problems with this method.

timmc-edx · 2023-08-23T20:43:43Z

Given that this seems to have died down somewhat, we may want to just close this for now.

timmc-edx · 2023-08-24T19:54:05Z

Possible further actions:

Check if we're using the current version of the runner
Run a script to check for lost-connection on the GitHub actions log side
Check if this is happening on PRs (not just on master)

timmc-edx · 2023-08-28T18:23:04Z

It turns out that looking at the GoCD logs alone isn't diagnostic, since the job that failed due to lost communication with the server is marked failure, while the other jobs are marked cancelled. We mostly shouldn't see any "real" failures on the master branch, even due to flaky tests, so these are probably mostly lost-communication errors... but it does mean that that histogram isn't necessarily correct.

I also tried looking at the GitHub action errors using our script in edx-arch-experiments, but it doesn't help -- that script currently only looks at the most recent attempt of any Action, and we almost always end up re-running CI checks when they fail on master. So that script would need to be modified if we wanted to use it to detect this.

robrap · 2023-08-29T16:39:54Z

@timmc-edx

First step is to understand the size of this problem. It feels like it is worth at least a time-boxed effort to fix our script in edx-arch-experiments to look at past attempts.
One we can estimate a potential cost of this problem, we could see if there is any way we can reduce the number of times it is happening. I've seen notes in the other tickets about potentially everything getting killed when scaling runners, or potentially having runners that are too small, or other configuration issues around "ephemeral" runners. I know nothing about our runners, so I can't speak to which if any of these might help us. The underlying question is simply whether or not there are any changes we can make that keeps this from happening, or at least minimizes its frequency.

timmc-edx · 2023-08-31T20:48:08Z

I agree that it would be really useful to be able to get the error messages from previous attempts, just in general for our ability to look at error frequency.

But I think the GoCD errors are enough to show that this error isn't occurring as much as it was, so I'm not sure it's worth it for this ticket specifically.

robrap · 2023-08-31T20:55:11Z

Maybe we should ticket the fix to the script, which will probably go on the backlog, and link to this ticket and icebox it?

timmc-edx · 2023-09-05T15:49:44Z

Created edx/edx-arch-experiments#437 for grooming; will icebox this one.

timmc-edx added this to Arch-BOM Jul 6, 2023

timmc-edx converted this from a draft issue Jul 6, 2023

timmc-edx mentioned this issue Jul 6, 2023

Monitor frequency of GitHub action errors #32550

Closed

2 tasks

davidjoy moved this to Prioritized in Arch-BOM Jul 10, 2023

jmbowman moved this from Prioritized to Groomed in Arch-BOM Jul 12, 2023

jmbowman moved this from Groomed to On-Call in Arch-BOM Jul 20, 2023

timmc-edx moved this from On-Call to In Progress in Arch-BOM Aug 16, 2023

timmc-edx self-assigned this Aug 16, 2023

timmc-edx mentioned this issue Aug 22, 2023

Ingest tools-automations EKS logs into Splunk edx/edx-arch-experiments#407

Closed

1 task

timmc-edx mentioned this issue Sep 5, 2023

GitHub Actions error collection script only reads latest attempt edx/edx-arch-experiments#437

Closed

timmc-edx removed the status in Arch-BOM Sep 5, 2023

RafayGhafoor mentioned this issue Oct 16, 2023

Support for running failed jobs based on annotated message. cli/cli#8196

Open

jristau1984 moved this to On-Call in Arch-BOM Jul 1, 2024

jristau1984 mentioned this issue Jul 1, 2024

Review and categorize on-call tickets edx/edx-arch-experiments#49

Closed

jristau1984 removed the status in Arch-BOM Jul 1, 2024

jristau1984 removed this from Arch-BOM Jul 22, 2024

timmc-edx added this to Arch-BOM Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

timmc-edx commented Jul 6, 2023 •

edited by jmbowman

Loading

robrap commented Jul 6, 2023

robrap commented Aug 15, 2023

robrap commented Aug 15, 2023

timmc-edx commented Aug 17, 2023 •

edited

Loading

timmc-edx commented Aug 22, 2023 •

edited

Loading

timmc-edx commented Aug 23, 2023 •

edited

Loading

timmc-edx commented Aug 23, 2023

timmc-edx commented Aug 24, 2023

timmc-edx commented Aug 28, 2023

robrap commented Aug 29, 2023

timmc-edx commented Aug 31, 2023

robrap commented Aug 31, 2023

timmc-edx commented Sep 5, 2023

edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

Comments

timmc-edx commented Jul 6, 2023 • edited by jmbowman Loading

robrap commented Jul 6, 2023

robrap commented Aug 15, 2023

robrap commented Aug 15, 2023

timmc-edx commented Aug 17, 2023 • edited Loading

timmc-edx commented Aug 22, 2023 • edited Loading

timmc-edx commented Aug 23, 2023 • edited Loading

timmc-edx commented Aug 23, 2023

timmc-edx commented Aug 24, 2023

timmc-edx commented Aug 28, 2023

robrap commented Aug 29, 2023

timmc-edx commented Aug 31, 2023

robrap commented Aug 31, 2023

timmc-edx commented Sep 5, 2023

timmc-edx commented Jul 6, 2023 •

edited by jmbowman

Loading

timmc-edx commented Aug 17, 2023 •

edited

Loading

timmc-edx commented Aug 22, 2023 •

edited

Loading

timmc-edx commented Aug 23, 2023 •

edited

Loading