Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor frequency of GitHub action errors #32550

Closed
2 tasks
rgraber opened this issue Jun 22, 2023 · 1 comment
Closed
2 tasks

Monitor frequency of GitHub action errors #32550

rgraber opened this issue Jun 22, 2023 · 1 comment
Assignees

Comments

@rgraber
Copy link
Contributor

rgraber commented Jun 22, 2023

A/C

  • We are able to find out how often a given action failure occurs via something automated
  • If the lost communication error happens regularly, we have a new issue created to address it

Implementation notes:
The "something automated" might be a script someone can run, or a dashboard, or something else entirely. Left for whoever picks this up.
If this is really hard, timebox to a day or two

Notes from the original creation of this ticket:

Github actions are occasionally failing with:

The self-hosted runner: edx-platform-openedx-ci-runner-deployment-7xdl7-zf8dj lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

This will cause final_checks_before_prod to fail.

Occurrences:

One possible use of this ticket would just be to determine if this is indeed happening more often than it used to.

Possible avenues of exploration:
Are github workers resource-starved?
May be able to use GitHub API to get status of jobs, then command line to get logs (if this shows up in the logs)

@rgraber rgraber added this to Arch-BOM Jun 22, 2023
@rgraber rgraber moved this to Prioritized in Arch-BOM Jun 22, 2023
@rgraber rgraber removed the status in Arch-BOM Jun 22, 2023
@rgraber rgraber moved this to Prioritized in Arch-BOM Jun 23, 2023
@rgraber rgraber changed the title Action runners failing with lost communication errors Monitor frequency of GitHub action errors Jun 26, 2023
@rgraber rgraber moved this from Prioritized to Groomed in Arch-BOM Jun 26, 2023
@jmbowman
Copy link
Contributor

I started discovery on some dashboard tools for GitHub Actions (and other CI services) that might have some out of the box functionality for classifying errors that could help here: openedx/public-engineering#168 .

@jmbowman jmbowman moved this from Groomed to On-Call in Arch-BOM Jul 20, 2023
@rgraber rgraber moved this from On-Call to In Progress in Arch-BOM Jul 26, 2023
@rgraber rgraber self-assigned this Jul 26, 2023
@rgraber rgraber closed this as completed Aug 3, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants