TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200

AdamWill · 2024-09-10T17:55:44Z

What does the ELN SIG need to do?

We have recently enabled testing of ELN updates in openQA - ELN updates are the ones with only four tests run on them - but some issues have emerged which @sgallagher and @yselkowitz and I discussed on a call:

Builds that are grouped into a single update for Rawhide - e.g. the four builds in https://bodhi.fedoraproject.org/updates/FEDORA-2024-f2990d44be - are split into separate updates for ELN - e.g. https://bodhi.fedoraproject.org/updates/FEDORA-2024-d4e7710ae3 , https://bodhi.fedoraproject.org/updates/FEDORA-2024-6402c35a7d and two others contain the equivalent builds from that single Rawhide update. This can cause openQA tests to fail because interdependent packages are not tested against each other.
Building off this we foresee potential issues with mass build events, like the Rawhide mass rebuild, KDE and GNOME megaupdates, or e.g. a Python or perl version bump which requires all Python or perl packages to be rebuilt. As things stand, this will likely create hundreds or thousands of separate Bodhi updates for ELN. This already happened once with the F41 Rawhide mass rebuild and caused various issues; now we have openQA update testing enabled, it would also flood openQA with hundreds of updates to test separately (openQA does not really have the capacity for this).

@sgallagher had a couple of ideas to address this on the ELN side:

When EBS is creating a batch of ELN updates, if there are fewer than X builds (X to be decided, but probably somewhere around 500-1000), create a single combined update instead of separate updates. This means openQA only has to run one test run (not several hundred), and it's more likely to pass (although if it fails it's slightly harder to pinpoint the cause).
If there are more than X builds, try and bypass Bodhi entirely in order to avoid the problems we had with the F41 mass rebuild, and also avoid flooding openQA. This might involve tagging the builds to a different tag so they should get signed without going through the automatic update creation workflow.

@sgallagher also said that due to the details of how EBS batching and buildroot handling work it's probably not feasible to gate ELN updates; he wants the tests to run, but he doesn't want the updates to wait for the tests to complete and be blocked if they fail, they should always go through. This is already how things work, but it does make the testing less 'effective' and mean we need more manual review and oversight (we need to notice when there's a failure, and any time there is a failure, it will 'cascade' to subsequent updates until it's resolved or the offending build is manually untagged). On the openQA side, I had a couple of ideas to mitigate this:

Move ELN tests to a separate 'group' in openQA. This is purely to make it easier to manually review the results. Instead of them being mixed in with tests of updates for all other Fedora releases, they'd be separate on the front page and there would be a URL where you could see only ELN update test results.
Gate composes instead of updates. For mainline Fedora we intended to do compose gating for years but ultimately never did, because it turned out that implementing enough update tests and gating updates obviated the need for it (we almost never, these days, have a Fedora compose so bad that we "should have" gated it, because of the update testing). But for ELN it sounds like we might have to actually do it. This would involve more or less the following:

Write a greenwave policy (trivial, I can do this in five minutes)
Have the compose script, or something else, wait for tests to complete and then query greenwave before 'blessing' the compose (I guess, in practice, this means syncing it to https://dl.fedoraproject.org/pub/eln/ , once we've fully switched off ODCS and onto the nightly script). This is much less trivial. The best way to know when the tests are done is to listen for fedora-messaging openqa.job.done messages, filter to ones for the compose you care about, and wait for one with a remaining value of 0 - this means that, at the time it finished, no other jobs were scheduled for the same compose. Then you can hit greenwave's API and request a decision (which comes back as JSON). This isn't really that difficult, but probably pretty messy to do in-line in a shell script. It might require converting the shell script into something more sophisticated, or taking the sync step out of the script and doing it in an infra "toddler" or a standalone message consumer. Doing it that way would also handle the case that a test fails and it's a bad needle or just a blip or something, then we re-run it and it passes; an always-listening consumer would trigger again in that case and sync the compose, a one-time script which only checks the first time it sees a message with a remaining count of 0 would not.

The text was updated successfully, but these errors were encountered:

AdamWill · 2024-09-10T18:01:32Z

I think https://bodhi.fedoraproject.org/updates/FEDORA-2024-c9a2438d21 is the biggest update we have successfully handled through Bodhi + openQA so far; it has 445 packages. I did various bits of work on the openQA tests to make them handle an update that size, I hope they would handle something even larger but it's impossible to know without trying. It does cause some issues for Bodhi itself too - creating or editing an update of that size takes a very long time, and loading the test results in the webUI takes a long time too.

AdamWill · 2024-09-10T18:05:20Z

I guess we could also just try unconditionally creating a batched update, no matter how large the batch, and see how Bodhi and openQA manage. I think the biggest we'd realistically wind up with is 2-3k packages? Maybe they could deal with that. I kinda suspect Bodhi might be stretched just a bit too far and start hitting timeouts on update creation, but maybe it'd be OK.

sgallagher · 2024-09-18T15:22:29Z

Sorry for the delay; I've been busy getting rid of ODCS.

First off, thanks for the extremely detailed summary of the conversation, @AdamWill

OK, on to the meat of the discussion:

Bodhi Updates

I'm worried about the failure case if we can't create a Bodhi update. What exactly is our fallback path if 1) Bodhi crashes and doesn't create an update at all or 2) produces a timeout on creation but DOES actually (eventually) create the update?

The core issue is that we need to ensure that the packages we just finished building get into the buildroot for the next batch as soon as possible, since the next batch may be relying on those that just finished. Any additional delay added to that (such as gating Bodhi updates) increases the risk that we won't have an appropriate buildroot for the next batch. Yes, we could delay the start of the next batch until the Bodhi update is pushed to stable, but if gating tests interfere, that could lead to blocking further builds entirely.

One probably-crazy idea I just had is that at the conclusion of a batch, we could immediately tag its results directly to the eln-build tag, rather than eln, essentially making them a buildroot override. Then we could potentially process the Bodhi megaupdate in parallel, untagging things from eln-build when they are tagged into eln. This would be a non-trivial amount of work added to ELNBuildSync, certainly. It potentially adds the ability for us to gate the Bodhi update as well, though in practice I'm not sure that we'd ever want to really do that. We won't really have a mechanism available to fix up individual failing packages in the Bodhi erratum, so we'd either have to push all of them stable or drop the whole update and submit a new one manually... neither option sounds great.

Compose testing

I like the idea of moving the sync to /pub/eln out of the eln-nightly.sh script and under the responsibility of another service. I think we'd probably want to run that service somewhere other than the compose host itself, which would mean we'd need to give that service access to execute a sync script if-and-when the compose promotion was approved. I assume we could set up a sudo rule allowing a bot to run a /usr/local/bin/sync_compose <src> /pub/eln/1 script on the compose host. Is that a thing we have a precedent for, @nirik ?

Obviously, if we implement this for ELN, we should do it in a way that's reusable for other Fedora streams like Rawhide.

AdamWill assigned tdawson and yselkowitz Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200

TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200

AdamWill commented Sep 10, 2024

AdamWill commented Sep 10, 2024

AdamWill commented Sep 10, 2024

sgallagher commented Sep 18, 2024

TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200

TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200

Comments

AdamWill commented Sep 10, 2024

What does the ELN SIG need to do?

AdamWill commented Sep 10, 2024

AdamWill commented Sep 10, 2024

sgallagher commented Sep 18, 2024

Bodhi Updates

Compose testing