GitHub Actions error collection script only reads latest attempt #437

timmc-edx · 2023-09-05T15:49:27Z

A/C:

The script iterates over all attempts rather than all jobs (we'll have to figure out how to do this)
Run the script and verify that it reveals more connection lost errors than it did previously
Write up brief findings on how often lost connection errors have been happening as a comment on edx-platform unit tests error: Self-hosted runner loses communication with the server openedx/edx-platform#32671

The Actions error collection script only collects the status of the most recent attempt on each job. Since we re-run most of our failed jobs, this script can't see most of the error information we're interested in.

See openedx/edx-platform#32671 for an example of where this information would have been useful.

While we're in there, it might also be useful to turn this into a multi-stage script with caching. Currently there's a risk of getting rate-limited partway through a run, at which point all of the in-memory collected information is lost. It might be better to split the script so that it first gets all of the commits in the desired time range, writes that to file, and then gets job and attempt information -- but only for jobs that it hasn't already cached on disk. This would speed up future runs.

RafayGhafoor · 2023-10-16T16:40:48Z

@robrap @timmc-edx, I would like to work on this task and I am thinking of using pyGithub library for the integration. Please let me know if I can work on this task.

robrap · 2023-10-16T19:35:44Z

@RafayGhafoor: That sounds good and we're here to answer questions. Good luck.

rgraber · 2023-11-16T14:29:40Z

@RafayGhafoor are you still working on this?

RafayGhafoor · 2023-11-16T16:20:42Z

@rgraber, I had been working on solving the task and went in to send a PR to enable github cli to rerun failed jobs based on annotated messages but the related issue created for PR didn't get any traction.

Normally, what I had in mind was to integrate github cli (gh) with the workflow which automatically reruns the job if the status for failed job has annotated message of "Lost connection....".

Since, the issue didn't get any follow up, I have lost motivation to work on it but I think a custom script which has the rights to rerun the failed jobs could be a possible solution which only operates on jobs failed due to losing communication to the server. Following are the steps that I had thought of adding as a last step to the CI:

Getting current running event and supplying it to the custom script.
The custom script checking if there's any failed job with annotated message of "Lost communication..." and triggering a rerun.
Wrapping this whole logic in retry so the action is retried at least x times with y delay to ensure successful run.

robrap · 2024-01-19T18:07:35Z

@feanil: This might be a useful ticket for the Maintenance WG as well, because it would give you a view of issues across PRs to help with prioritization of tickets like #528.

timmc-edx · 2024-02-01T18:03:05Z

I made an attempt at fixing this in #544, which also includes some other improvements. But... it turns out all of the attempt objects for a workflow run are pointing to the same check suite! (The most recent one, naturally -- which means we lose any errors that provoke someone to re-run their tests.) This is blocked unless we can find a solution. I've posted about the issue at https://github.com/orgs/community/discussions/103026.

jristau1984 · 2024-08-05T13:28:26Z

This is now a Product Feedback submission, since it appears that this is just a bug in the API: https://github.com/orgs/community/discussions/124000

timmc-edx mentioned this issue Sep 5, 2023

edx-platform unit tests error: Self-hosted runner loses communication with the server openedx/edx-platform#32671

Open

timmc-edx added this to Arch-BOM Sep 5, 2023

timmc-edx moved this to Prioritized in Arch-BOM Sep 5, 2023

timmc-edx removed the status in Arch-BOM Sep 5, 2023

robrap moved this to Prioritized in Arch-BOM Sep 6, 2023

jmbowman moved this from Prioritized to On-Call in Arch-BOM Sep 7, 2023

rgraber moved this from On-Call to In Progress in Arch-BOM Jan 9, 2024

rgraber moved this from In Progress to On-Call in Arch-BOM Jan 12, 2024

robrap assigned timmc-edx and openback and unassigned openback Jan 26, 2024

robrap mentioned this issue Jan 26, 2024

Replace codecov in openedx-events #528

Closed

3 tasks

timmc-edx moved this from On-Call to In Progress in Arch-BOM Jan 29, 2024

timmc-edx moved this from In Progress to Blocked in Arch-BOM Feb 1, 2024

jristau1984 added the on-call label Jul 1, 2024

jristau1984 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024

github-project-automation bot moved this from Blocked to Done in Arch-BOM Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Actions error collection script only reads latest attempt #437

GitHub Actions error collection script only reads latest attempt #437

timmc-edx commented Sep 5, 2023 •

edited by jmbowman

Loading

RafayGhafoor commented Oct 16, 2023

robrap commented Oct 16, 2023 •

edited

Loading

rgraber commented Nov 16, 2023

RafayGhafoor commented Nov 16, 2023 •

edited

Loading

robrap commented Jan 19, 2024

timmc-edx commented Feb 1, 2024

jristau1984 commented Aug 5, 2024

GitHub Actions error collection script only reads latest attempt #437

GitHub Actions error collection script only reads latest attempt #437

Comments

timmc-edx commented Sep 5, 2023 • edited by jmbowman Loading

RafayGhafoor commented Oct 16, 2023

robrap commented Oct 16, 2023 • edited Loading

rgraber commented Nov 16, 2023

RafayGhafoor commented Nov 16, 2023 • edited Loading

robrap commented Jan 19, 2024

timmc-edx commented Feb 1, 2024

jristau1984 commented Aug 5, 2024

timmc-edx commented Sep 5, 2023 •

edited by jmbowman

Loading

robrap commented Oct 16, 2023 •

edited

Loading

RafayGhafoor commented Nov 16, 2023 •

edited

Loading