Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edx-platform unit tests error: Self-hosted runner loses communication with the server #32671

Open
timmc-edx opened this issue Jul 6, 2023 · 13 comments
Assignees

Comments

@timmc-edx
Copy link
Contributor

timmc-edx commented Jul 6, 2023

A/C:

  • We no longer get this error
  • If fixing the error is prohibitively difficult, raise again for discussion of alternative approaches

Since roughly June 2023 we've been seeing a lot of CI failures on GitHub for edx-platform where the job is just shown as "Job failed" with no logs. But if you check the Summary view (link in the upper left) and look at the errors at the bottom of the page, the following error shows up:

The self-hosted runner: edx-platform-openedx-ci-runner-deployment-7xdl7-twcfc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

These have been causing development and deployment delays. We'll need to check if there's something wrong with the runners.

Image
Image

@timmc-edx timmc-edx converted this from a draft issue Jul 6, 2023
@robrap
Copy link
Contributor

robrap commented Jul 6, 2023

Sharing related issue: actions/actions-runner-controller#466

@davidjoy davidjoy moved this to Prioritized in Arch-BOM Jul 10, 2023
@jmbowman jmbowman moved this from Prioritized to Groomed in Arch-BOM Jul 12, 2023
@jmbowman jmbowman moved this from Groomed to On-Call in Arch-BOM Jul 20, 2023
@robrap
Copy link
Contributor

robrap commented Aug 15, 2023

@rgraber: Is this something you were or were not able to see using your script to gather errors from github? If we can, maybe we could link to some of that information?

@robrap
Copy link
Contributor

robrap commented Aug 15, 2023

There may be good info available via the above issue. This issue may have details for how to query for this? int128/datadog-actions-metrics#445. It would be interesting to compare to Becca's script.

@timmc-edx timmc-edx moved this from On-Call to In Progress in Arch-BOM Aug 16, 2023
@timmc-edx timmc-edx self-assigned this Aug 16, 2023
@timmc-edx
Copy link
Contributor Author

timmc-edx commented Aug 17, 2023

Notes:

  • This runner is not the github-actions-runner pods in our regular tools EKS cluster, but rather the edx-platform-openedx-ci-runner-deployment pods in the -automations cluster.
  • I don't think the logs in that cluster are being ingested into Splunk. Filed Ingest tools-automations EKS logs into Splunk edx/edx-arch-experiments#407 for that.
  • We do appear to have the logs in New Relic, under "cluster_name":"tools-edx-eks-automations" "namespace_name":"actions-runner-system-openedx". Not sure yet if they're useful.

@timmc-edx
Copy link
Contributor Author

timmc-edx commented Aug 22, 2023

Info for the failure screenshotted in the ticket:

Searching that pod for lines containing "job" turns up 2023-07-06T14:43:44.510470685Z stdout F 2023-07-06 14:43:44Z: Running job: python-3.8,django-pinned,lms-5 which matches the job name, but not the time; the jobs should all have started at about 14:32:31. In fact, a query for that workflow job name across all runners ("namespace_name":"actions-runner-system-openedx" "python-3.8,django-pinned,lms-5") doesn't show any pod receiving that job at the expected time.

The runner was actually active at about that time, but exited and restarted shortly after:

2023-07-06T14:24:59.016438764Z stdout F 2023-07-06 14:24:59Z: Running job: python-3.8,django-pinned,openedx-3
2023-07-06T14:32:41.821472919Z stdout F 2023-07-06 14:32:41Z: Job python-3.8,django-pinned,openedx-3 completed with result: Succeeded
2023-07-06T14:32:41.910505073Z stdout F Runner listener exited with error code 0
2023-07-06T14:32:41.910546161Z stdout F Runner listener exit with 0 return code, stop the service, no retry needed.
2023-07-06T14:43:38.524140086Z stderr F �[0;97mDocker wait check skipped. Either Docker is disabled or the wait is disabled, continuing with entrypoint�[0m 
2023-07-06T14:43:38.524202006Z stderr F �[0;97mGithub endpoint URL https://github.com/�[0m
2023-07-06T14:43:39.000094928Z stderr F �[0;97mConfiguring the runner.�[0m

...so it probably was assigned the job, but never picked it up or something. (I'm hazy on how work assignment is done with these.) There could also be some time skew. But it did claim to exit normally, which is surprising.

Searching for references to the pod around that time ("cluster_name":"tools-edx-eks-automations" "message":"*7xdl7-twcfc*" in 14:25 to 14:45) I find that tools-auto-actions-runner-controller-openedx-58fb45666d-w7lxx deleted the runner pod: 2023-07-06T14:32:42.533768539Z stderr F 2023-07-06T14:32:42.533Z INFO actions-runner-controller.runner Deleted runner pod {"runner": "actions-runner-system-openedx/edx-platform-openedx-ci-runner-deployment-7xdl7-twcfc", "repository": "openedx/edx-platform"}.

...and that's about it. The autoscaler mentions it a few minutes earlier as a pod that isn't a candidate for scale-down, and the controller doesn't have any nearby messages about errors or unreachability or anything.

@timmc-edx
Copy link
Contributor Author

timmc-edx commented Aug 23, 2023

I wanted to know how often we've had CD failures due to this, so I wrote some shell scripts to call the GoCD API and queried the job logs:

  1. Exported variables GOCD_TOKEN (a personal token generated in the GoCD UI) and GOCD_SERVER (https://domain)
  2. Retrieved the run history for the check_ci job: ./job-history.sh final_checks_before_prod/check_ci/check_ci_job_edx_platform > ~/cached/gocd/check_ci-history-20230823.jsonl
  3. Fetch job log contents and identify the ones that saw a cancelled check: ./job-identify-matching-failures.sh ~/cached/gocd/check_ci-history-20230823.jsonl ~/cached/gocd/artifacts '/check-runs/[0-9]+ cancelled$' > ~/cached/gocd/check_ci-cancelled-20230823.jsonl

Then I was able to build a histogram of cancelled checks by month:

$ cat ~/cached/gocd/check_ci-cancelled-20230823.jsonl | while IFS= read -r json; do date -d @$(echo "$json" | jq .job_run.scheduled_date | head -c-4) --utc +"%Y-%m"; done | sort | uniq -c
      1 2023-01
      6 2023-02
      2 2023-03
      2 2023-04
      7 2023-05
     14 2023-06
     12 2023-07
      3 2023-08

Update 2023-08-28: See later comments on problems with this method.

@timmc-edx
Copy link
Contributor Author

Given that this seems to have died down somewhat, we may want to just close this for now.

@timmc-edx
Copy link
Contributor Author

Possible further actions:

  • Check if we're using the current version of the runner
  • Run a script to check for lost-connection on the GitHub actions log side
  • Check if this is happening on PRs (not just on master)

@timmc-edx
Copy link
Contributor Author

It turns out that looking at the GoCD logs alone isn't diagnostic, since the job that failed due to lost communication with the server is marked failure, while the other jobs are marked cancelled. We mostly shouldn't see any "real" failures on the master branch, even due to flaky tests, so these are probably mostly lost-communication errors... but it does mean that that histogram isn't necessarily correct.

I also tried looking at the GitHub action errors using our script in edx-arch-experiments, but it doesn't help -- that script currently only looks at the most recent attempt of any Action, and we almost always end up re-running CI checks when they fail on master. So that script would need to be modified if we wanted to use it to detect this.

@robrap
Copy link
Contributor

robrap commented Aug 29, 2023

@timmc-edx

  1. First step is to understand the size of this problem. It feels like it is worth at least a time-boxed effort to fix our script in edx-arch-experiments to look at past attempts.
  2. One we can estimate a potential cost of this problem, we could see if there is any way we can reduce the number of times it is happening. I've seen notes in the other tickets about potentially everything getting killed when scaling runners, or potentially having runners that are too small, or other configuration issues around "ephemeral" runners. I know nothing about our runners, so I can't speak to which if any of these might help us. The underlying question is simply whether or not there are any changes we can make that keeps this from happening, or at least minimizes its frequency.

@timmc-edx
Copy link
Contributor Author

I agree that it would be really useful to be able to get the error messages from previous attempts, just in general for our ability to look at error frequency.

But I think the GoCD errors are enough to show that this error isn't occurring as much as it was, so I'm not sure it's worth it for this ticket specifically.

@robrap
Copy link
Contributor

robrap commented Aug 31, 2023

Maybe we should ticket the fix to the script, which will probably go on the backlog, and link to this ticket and icebox it?

@timmc-edx
Copy link
Contributor Author

Created edx/edx-arch-experiments#437 for grooming; will icebox this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants