Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

trws · 2024-07-20T22:32:26Z

This ended up being more churn than I intended, but once I got into it here's where I ended up.

Effects:

add containers for: ubuntu/noble, fedora/40, el9
add
factor out caliper build into a script, fix build on fedora and noble with 1-line patch
factor python deps into a requirements file so we don't have to keep copy/pasting it
change ubuntu and debian dockerfiles to use a single apt update rather than many
add catch2 all the way at the bottom, this is mainly for sched, but if we can count on it could also be used for core testing since it helps so much with things like actually printing values of unit tests that failed, and making fixtures a whole lot easier to manage, it's also not large
add env var to fix stupid HWLOC problem where it tries to look for opengl devices by blindly talking to a port that sometimes is an x11 server

All of these are built and pushed for all appropriate architectures. The bookworm, jammy and alpine tweaks shouldn't have any impact on existing tests, since they're additions or moving things, the rest are new containers at new names.

TODO:

update github actions to create manifests for the new arm64 containers
update check requirements

trws · 2024-07-20T23:41:54Z

Almost everything works, but there's an issue with mpi on the two most recent distros, fedora40 and noble. It makes me wonder if there's another issue with the PMI in the newer mpich maybe? The problem goes away in the test that sets PSM3_HAL (not sure what that does, or why, but it fixes it 🤷). @garlick is there a chance this is related to the PMI issue you mentioned with hydra/mpich? I remember that was going the other way, but the cause of the bootstrap issue isn't jumping out at me.

garlick · 2024-07-21T00:15:11Z

That was #6072 which was worked around by #6081 (merged).

Edit: meant to add that (as you alluded) this was a problem with hydra launching flux.
I don't think I tested flux launching that version of mpich.

trws · 2024-07-21T17:20:42Z

Right, I should have mentioned I saw that was fixed, sorry. The current behavior seems to be that "flux run -n 8 abort" say runs eight rank 0 processes. Everything works in the one test that sets PSM3_HAL though? And I can't offhand remember what that does. Clearly something going funny with PMI so I was vaguely hoping it was related, maybe needing a fix on the mpich side or a different workaround from us.

garlick · 2024-07-22T15:12:12Z

Looks like the PSM_HAL variable was set to address mpich problems in fedora39. Presumably we need it for fedora40 as well.

See also

(Not sure I understand yet why that is needed)

trws · 2024-07-22T16:03:25Z

Thanks for that @garlick. I'm guessing it's related to MPI version, but it looks like what that does is turn off all multi-node features... that's potentially bad. Will do a bit more digging but this seems like an issue we may legitimately need a fix for, maybe we're picking up an MPI that needs PMIX?

garlick · 2024-07-22T17:11:28Z

The MPI tests do check for the "N singletons" type of error because that is a common one. For example

test_expect_success 'mpi hello various sizes' '
        run_timeout 30 flux submit --cc=1-$MAX_MPI_SIZE $OPTS \
                --watch -n{cc} ${HELLO} >hello.out &&
        for i in $(seq 1 $MAX_MPI_SIZE); do \
                grep "There are $i tasks" hello.out; \
        done
'

I think The PSM3 environment variable just needs to be set for fedora because fedora now ships an environment module with mpich (based on my comment in #5694) and presumably they included PathScale support that has to be suppressed.

I didn't look into why it's set for one fedora builder but not the other

trws · 2024-07-22T17:37:25Z

Have I mentioned I'm frustrated with Ubuntu lately? This is just absolutely bonkers: https://bugs.launchpad.net/ubuntu/+source/mpich/+bug/2072338

They linked their mpich with libpmix in noble, so not only do we not work with mpich, neither does hydra. 🤯

garlick · 2024-07-22T17:42:49Z

They linked their mpich with libpmix in noble, so not only do we not work with mpich, neither does hydra.

🤦🤦🤦🤦🤦

grondo · 2024-07-22T20:55:06Z

It is a bit disturbing that this change drops code coverage by >5%

codecov/project — 78.18% (-5.18%) compared to f5c5079

trws · 2024-07-22T20:58:31Z

I think that might be because the el9 test just wouldn't finish. If it's still like that when it's all finishing, I'm planning to dig into it more. Maybe we're only doing coverage on an RPM distro now rather than both deb and rpm? Not sure, either way, definitely want to fix that before changing check requirements and merging this.

grondo · 2024-07-22T21:10:57Z

Ah, we only do two coverage runs, they take too long. Currently I think there's a a coverage build that uses the default image, then one that just runs the "system" tests.

trws · 2024-07-23T02:09:20Z

Ok, don't have the new codecov comment yet, but all the coverage posted and the python coverage is reporting the same numbers as on master. This should be ready for a review.

trws · 2024-07-26T19:10:37Z

@grondo, @garlick any further changes desired here? It would be nice to drop focal off the end before the interner pr lands in sched if that doesn't cause issues for someone, it's the only one that really doesn't support c++20 of the images we're currently supporting.

grondo · 2024-07-26T19:20:27Z

Let me take another look, but if all the tests are passing then it is probably all good!

It would be nice to drop focal off the end before the interner pr lands in sched if that doesn't cause issues for someone, it's the only one that really doesn't support c++20 of the images we're currently supporting.

It is probably fine to drop focal support in flux-sched whenever, nothing says that other projects have to support the same distros as flux-core. Looks like focal doesn't reach EOL until Apr 2025, though I doubt we have many users (though I still have a couple vms at that version)

grondo · 2024-07-26T19:26:13Z

Oh, one reminder: We should update branch protections just before merging this one (we may have to since there are names that now no longer apply.)

grondo · 2024-07-26T19:42:37Z

Also, it appears some subprojects use focal, it might be best to just keep that for now until they update, or we can transition them to newer images before landing this PR. (flux-coral2, flux-pmix, and flux-pam are all affected)

trws · 2024-07-26T19:59:42Z

Yup, various subprojects still use the older images since I started from the bottom. The theory on it being about time for focal is that it's two LTS releases ago, and the only distro we support that can't provide a reasonably recent compiler (the newest in focal repos seems to be gcc-10). The RHEL images have gone from some of the most difficult to deal with to some of the easiest, because they have been doing such a good job supporting new compilers, heck even lassen has gcc-12 on it from upstream rhel repos. That's why the older rhel isn't causing similar issues.

Anyway, I can certainly add the docker build/tag back in here while we address the downstream projects, will make that change.

grondo · 2024-07-26T20:05:43Z

The theory on it being about time for focal is that it's two LTS releases ago, and the only distro we support that can't provide a reasonably recent compiler (the newest in focal repos seems to be gcc-10).

Yeah, this makes sense to me!

problem: we don't have recent distros, and some of them take a _long_ time to build because of repeated fetches from slow ports repos (aarch64 repos are really slow on ubuntu for some reason, so repeated apt update == bad) solution: add el9, fedora40, and ubuntu noble while refactoring noble to use a single apt update and apt install pair as well as switching el9 to use dnf (much faster than yum, and doesn't require separate update)

problem: We need to test the newer containers, and have only the one image built for arm64 solution: update several of our tests to new container versions, build noble, fedora40, el8 for arm, focal remains but should be removed in the near future

trws · 2024-07-27T04:02:36Z

The focal docker build has been restored.

grondo

This LGTM! Just spotted one possible leftover debug line in one of the commits.

Also, had to restart the coverage build because we hit an instance of #6078

grondo · 2024-07-27T13:52:09Z

t/t3003-mpi-abort.t

-
+set -x


Leftover debug?

Yup, thanks for that.

grondo · 2024-07-27T14:05:20Z

Not sure if we just want throw on a commit that sets GCOV_ERROR_FILE=/dev/null here to avoid the occasional coverage failure (#6078) (if you agree that's the right fix)

problem: some hwloc versions try to connect to the x11 port if it exists causing all kinds of problems solution: set the env var to tell hwloc to not do that

problem: new docker images are not pushed with combined manifests solution: push manifests in github actions

problem: checks_run only reports coverage if the initial check succeeds, but reports success if recheck succeeds. That hides errors in the coverage reporting and means we don't get some of our test coverage whenever a spurious test failure happens during coverage solution: move coverage reporting into a new POSTCHECKCMDS variable that we evaluate if check succeeds or if recheck succeeds

problem: we keep getting merge issues with gcov files. It seems the default for updating these is "single" meaning single threaded unless the compilation is done with `-pthread`. I'm guessing we see the errors most on libraries we use in multiple places that are not themselves compiled that way, even though they're eventually linked into binaries that are. If this doesn't cover it, we should also do what's listed in solution: add `-fprofile-update=atomic` to our coverage flags

trws · 2024-07-27T17:36:03Z

Thanks for the pointer, I had missed that issue. Since we run tests in parallel, and some of these libraries get used in threads, I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane. At worst it shouldn't hurt, if it doesn't take care of it we should probably do the env var as well.

trws · 2024-07-27T18:13:28Z

Looks like we're green here pending branch check updates.

grondo · 2024-07-27T21:57:25Z

I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane

Oh cool, I searched all around and didn't find that. My only worry is that the coverage check, which already takes the longest, will be slower with atomic updates.

grondo · 2024-07-27T21:58:22Z

Anyway, already approved if you want to make the protected branches updates then set MWP. Thanks for the work here!

trws · 2024-07-28T04:56:49Z

I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane

Oh cool, I searched all around and didn't find that. My only worry is that the coverage check, which already takes the longest, will be slower with atomic updates.

Much as I agree, I’m in favor of having the results be reliable, and would probably propose we split the coverage testing into two runners if we end up needing to. That said, the runtime after the change is actually slightly lower than in another PR currently in flight. I assume that’s just noise in the runs, but it doesn’t seem to have had a meaningful performance impact. Possibly because much of our code is already using atomic updates because it’s visible that it’s threaded at link? Not sure, either way, definitely something to keep an eye on but probably not an immediate issue.

trws · 2024-07-28T17:26:53Z

Ok, check requirements on master all updated, MWP set.

codecov · 2024-10-26T17:30:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.37%. Comparing base (8a2d204) to head (81e233e).
Report is 536 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6128   +/-   ##
=======================================
  Coverage   83.37%   83.37%           
=======================================
  Files         521      521           
  Lines       84653    84653           
=======================================
+ Hits        70577    70582    +5     
+ Misses      14076    14071    -5

see 10 files with indirect coverage changes

trws requested a review from grondo July 20, 2024 22:32

trws force-pushed the update-containers branch 6 times, most recently from 7995d61 to c0965c3 Compare July 20, 2024 23:39

trws force-pushed the update-containers branch 6 times, most recently from 45dc5d0 to d76843a Compare July 22, 2024 20:53

trws force-pushed the update-containers branch 3 times, most recently from d31c3cb to 240de5c Compare July 22, 2024 23:33

trws requested a review from garlick July 23, 2024 20:03

trws added 2 commits July 26, 2024 21:01

trws force-pushed the update-containers branch from 713ca07 to c6cc230 Compare July 27, 2024 04:01

grondo approved these changes Jul 27, 2024

View reviewed changes

trws added 4 commits July 27, 2024 10:31

t/300*: add HWLOC_COMPONENTS=-gl to mpi tests

0ba1609

problem: some hwloc versions try to connect to the x11 port if it exists causing all kinds of problems solution: set the env var to tell hwloc to not do that

actions: add new arm64 manifests

81394d5

problem: new docker images are not pushed with combined manifests solution: push manifests in github actions

trws force-pushed the update-containers branch from c6cc230 to 81e233e Compare July 27, 2024 17:31

trws added the merge-when-passing label Jul 28, 2024

mergify bot merged commit 3dc3207 into flux-framework:master Jul 28, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

trws commented Jul 20, 2024 •

edited

Loading

trws commented Jul 20, 2024

garlick commented Jul 21, 2024 •

edited

Loading

trws commented Jul 21, 2024

garlick commented Jul 22, 2024

trws commented Jul 22, 2024

garlick commented Jul 22, 2024

trws commented Jul 22, 2024

garlick commented Jul 22, 2024

grondo commented Jul 22, 2024

trws commented Jul 22, 2024

grondo commented Jul 22, 2024

trws commented Jul 23, 2024

trws commented Jul 26, 2024

grondo commented Jul 26, 2024

grondo commented Jul 26, 2024 •

edited

Loading

grondo commented Jul 26, 2024 •

edited

Loading

trws commented Jul 26, 2024

grondo commented Jul 26, 2024

trws commented Jul 27, 2024

grondo left a comment

grondo Jul 27, 2024

trws Jul 27, 2024

grondo commented Jul 27, 2024

trws commented Jul 27, 2024

trws commented Jul 27, 2024

grondo commented Jul 27, 2024

grondo commented Jul 27, 2024

trws commented Jul 28, 2024

trws commented Jul 28, 2024

codecov bot commented Oct 26, 2024

Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

Conversation

trws commented Jul 20, 2024 • edited Loading

trws commented Jul 20, 2024

garlick commented Jul 21, 2024 • edited Loading

trws commented Jul 21, 2024

garlick commented Jul 22, 2024

trws commented Jul 22, 2024

garlick commented Jul 22, 2024

trws commented Jul 22, 2024

garlick commented Jul 22, 2024

grondo commented Jul 22, 2024

trws commented Jul 22, 2024

grondo commented Jul 22, 2024

trws commented Jul 23, 2024

trws commented Jul 26, 2024

grondo commented Jul 26, 2024

grondo commented Jul 26, 2024 • edited Loading

grondo commented Jul 26, 2024 • edited Loading

trws commented Jul 26, 2024

grondo commented Jul 26, 2024

trws commented Jul 27, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Jul 27, 2024

Choose a reason for hiding this comment

trws Jul 27, 2024

Choose a reason for hiding this comment

grondo commented Jul 27, 2024

trws commented Jul 27, 2024

trws commented Jul 27, 2024

grondo commented Jul 27, 2024

grondo commented Jul 27, 2024

trws commented Jul 28, 2024

trws commented Jul 28, 2024

codecov bot commented Oct 26, 2024

Codecov Report

trws commented Jul 20, 2024 •

edited

Loading

garlick commented Jul 21, 2024 •

edited

Loading

grondo commented Jul 26, 2024 •

edited

Loading

grondo commented Jul 26, 2024 •

edited

Loading