Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128

Merged
merged 6 commits into from
Jul 28, 2024

Conversation

trws
Copy link
Member

@trws trws commented Jul 20, 2024

This ended up being more churn than I intended, but once I got into it here's where I ended up.

Effects:

  • add containers for: ubuntu/noble, fedora/40, el9
  • add
  • factor out caliper build into a script, fix build on fedora and noble with 1-line patch
  • factor python deps into a requirements file so we don't have to keep copy/pasting it
  • change ubuntu and debian dockerfiles to use a single apt update rather than many
  • add catch2 all the way at the bottom, this is mainly for sched, but if we can count on it could also be used for core testing since it helps so much with things like actually printing values of unit tests that failed, and making fixtures a whole lot easier to manage, it's also not large
  • add env var to fix stupid HWLOC problem where it tries to look for opengl devices by blindly talking to a port that sometimes is an x11 server

All of these are built and pushed for all appropriate architectures. The bookworm, jammy and alpine tweaks shouldn't have any impact on existing tests, since they're additions or moving things, the rest are new containers at new names.

TODO:

  • update github actions to create manifests for the new arm64 containers
  • update check requirements

@trws trws requested a review from grondo July 20, 2024 22:32
@trws trws force-pushed the update-containers branch 6 times, most recently from 7995d61 to c0965c3 Compare July 20, 2024 23:39
@trws
Copy link
Member Author

trws commented Jul 20, 2024

Almost everything works, but there's an issue with mpi on the two most recent distros, fedora40 and noble. It makes me wonder if there's another issue with the PMI in the newer mpich maybe? The problem goes away in the test that sets PSM3_HAL (not sure what that does, or why, but it fixes it 🤷). @garlick is there a chance this is related to the PMI issue you mentioned with hydra/mpich? I remember that was going the other way, but the cause of the bootstrap issue isn't jumping out at me.

@garlick
Copy link
Member

garlick commented Jul 21, 2024

That was #6072 which was worked around by #6081 (merged).

Edit: meant to add that (as you alluded) this was a problem with hydra launching flux.
I don't think I tested flux launching that version of mpich.

@trws
Copy link
Member Author

trws commented Jul 21, 2024

Right, I should have mentioned I saw that was fixed, sorry. The current behavior seems to be that "flux run -n 8 abort" say runs eight rank 0 processes. Everything works in the one test that sets PSM3_HAL though? And I can't offhand remember what that does. Clearly something going funny with PMI so I was vaguely hoping it was related, maybe needing a fix on the mpich side or a different workaround from us.

@garlick
Copy link
Member

garlick commented Jul 22, 2024

Looks like the PSM_HAL variable was set to address mpich problems in fedora39. Presumably we need it for fedora40 as well.

See also

(Not sure I understand yet why that is needed)

@trws
Copy link
Member Author

trws commented Jul 22, 2024

Thanks for that @garlick. I'm guessing it's related to MPI version, but it looks like what that does is turn off all multi-node features... that's potentially bad. Will do a bit more digging but this seems like an issue we may legitimately need a fix for, maybe we're picking up an MPI that needs PMIX?

@garlick
Copy link
Member

garlick commented Jul 22, 2024

The MPI tests do check for the "N singletons" type of error because that is a common one. For example

test_expect_success 'mpi hello various sizes' '
        run_timeout 30 flux submit --cc=1-$MAX_MPI_SIZE $OPTS \
                --watch -n{cc} ${HELLO} >hello.out &&
        for i in $(seq 1 $MAX_MPI_SIZE); do \
                grep "There are $i tasks" hello.out; \
        done
'

I think The PSM3 environment variable just needs to be set for fedora because fedora now ships an environment module with mpich (based on my comment in #5694) and presumably they included PathScale support that has to be suppressed.

I didn't look into why it's set for one fedora builder but not the other

@trws
Copy link
Member Author

trws commented Jul 22, 2024

Have I mentioned I'm frustrated with Ubuntu lately? This is just absolutely bonkers: https://bugs.launchpad.net/ubuntu/+source/mpich/+bug/2072338

They linked their mpich with libpmix in noble, so not only do we not work with mpich, neither does hydra. 🤯

@garlick
Copy link
Member

garlick commented Jul 22, 2024

They linked their mpich with libpmix in noble, so not only do we not work with mpich, neither does hydra.

🤦🤦🤦🤦🤦

@trws trws force-pushed the update-containers branch 6 times, most recently from 45dc5d0 to d76843a Compare July 22, 2024 20:53
@grondo
Copy link
Contributor

grondo commented Jul 22, 2024

It is a bit disturbing that this change drops code coverage by >5%

codecov/project — 78.18% (-5.18%) compared to f5c5079 

@trws
Copy link
Member Author

trws commented Jul 22, 2024

I think that might be because the el9 test just wouldn't finish. If it's still like that when it's all finishing, I'm planning to dig into it more. Maybe we're only doing coverage on an RPM distro now rather than both deb and rpm? Not sure, either way, definitely want to fix that before changing check requirements and merging this.

@grondo
Copy link
Contributor

grondo commented Jul 22, 2024

Ah, we only do two coverage runs, they take too long. Currently I think there's a a coverage build that uses the default image, then one that just runs the "system" tests.

@trws trws force-pushed the update-containers branch 3 times, most recently from d31c3cb to 240de5c Compare July 22, 2024 23:33
@trws
Copy link
Member Author

trws commented Jul 23, 2024

Ok, don't have the new codecov comment yet, but all the coverage posted and the python coverage is reporting the same numbers as on master. This should be ready for a review.

@trws trws requested a review from garlick July 23, 2024 20:03
@trws
Copy link
Member Author

trws commented Jul 26, 2024

@grondo, @garlick any further changes desired here? It would be nice to drop focal off the end before the interner pr lands in sched if that doesn't cause issues for someone, it's the only one that really doesn't support c++20 of the images we're currently supporting.

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

Let me take another look, but if all the tests are passing then it is probably all good!

It would be nice to drop focal off the end before the interner pr lands in sched if that doesn't cause issues for someone, it's the only one that really doesn't support c++20 of the images we're currently supporting.

It is probably fine to drop focal support in flux-sched whenever, nothing says that other projects have to support the same distros as flux-core. Looks like focal doesn't reach EOL until Apr 2025, though I doubt we have many users (though I still have a couple vms at that version)

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

Oh, one reminder: We should update branch protections just before merging this one (we may have to since there are names that now no longer apply.)

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

Also, it appears some subprojects use focal, it might be best to just keep that for now until they update, or we can transition them to newer images before landing this PR. (flux-coral2, flux-pmix, and flux-pam are all affected)

@trws
Copy link
Member Author

trws commented Jul 26, 2024

Yup, various subprojects still use the older images since I started from the bottom. The theory on it being about time for focal is that it's two LTS releases ago, and the only distro we support that can't provide a reasonably recent compiler (the newest in focal repos seems to be gcc-10). The RHEL images have gone from some of the most difficult to deal with to some of the easiest, because they have been doing such a good job supporting new compilers, heck even lassen has gcc-12 on it from upstream rhel repos. That's why the older rhel isn't causing similar issues.

Anyway, I can certainly add the docker build/tag back in here while we address the downstream projects, will make that change.

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

The theory on it being about time for focal is that it's two LTS releases ago, and the only distro we support that can't provide a reasonably recent compiler (the newest in focal repos seems to be gcc-10).

Yeah, this makes sense to me!

trws added 2 commits July 26, 2024 21:01
problem: we don't have recent distros, and some of them take a _long_
time to build because of repeated fetches from slow ports repos (aarch64
repos are really slow on ubuntu for some reason, so repeated apt update
== bad)

solution: add el9, fedora40, and ubuntu noble while refactoring noble to
use a single apt update and apt install pair as well as switching el9 to
use dnf (much faster than yum, and doesn't require separate update)
problem: We need to test the newer containers, and have only the one
image built for arm64

solution: update several of our tests to new container versions, build
noble, fedora40, el8 for arm, focal remains but should be removed in the
near future
@trws trws force-pushed the update-containers branch from 713ca07 to c6cc230 Compare July 27, 2024 04:01
@trws
Copy link
Member Author

trws commented Jul 27, 2024

The focal docker build has been restored.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Just spotted one possible leftover debug line in one of the commits.

Also, had to restart the coverage build because we hit an instance of #6078

Comment on lines 40 to 42

set -x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks for that.

@grondo
Copy link
Contributor

grondo commented Jul 27, 2024

Not sure if we just want throw on a commit that sets GCOV_ERROR_FILE=/dev/null here to avoid the occasional coverage failure (#6078) (if you agree that's the right fix)

trws added 4 commits July 27, 2024 10:31
problem: some hwloc versions try to connect to the x11 port if it exists
causing all kinds of problems

solution: set the env var to tell hwloc to not do that
problem: new docker images are not pushed with combined manifests
solution: push manifests in github actions
problem: checks_run only reports coverage if the initial check succeeds,
but reports success if recheck succeeds. That hides errors in the
coverage reporting and means we don't get some of our test coverage
whenever a spurious test failure happens during coverage

solution: move coverage reporting into a new POSTCHECKCMDS variable that
we evaluate if check succeeds or if recheck succeeds
problem: we keep getting merge issues with gcov files.  It seems the
default for updating these is "single" meaning single threaded unless
the compilation is done with `-pthread`. I'm guessing we see the errors
most on libraries we use in multiple places that are not themselves
compiled that way, even though they're eventually linked into binaries
that are. If this doesn't cover it, we should also do what's listed in

solution: add `-fprofile-update=atomic` to our coverage flags
@trws trws force-pushed the update-containers branch from c6cc230 to 81e233e Compare July 27, 2024 17:31
@trws
Copy link
Member Author

trws commented Jul 27, 2024

Thanks for the pointer, I had missed that issue. Since we run tests in parallel, and some of these libraries get used in threads, I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane. At worst it shouldn't hurt, if it doesn't take care of it we should probably do the env var as well.

@trws
Copy link
Member Author

trws commented Jul 27, 2024

Looks like we're green here pending branch check updates.

@grondo
Copy link
Contributor

grondo commented Jul 27, 2024

I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane

Oh cool, I searched all around and didn't find that. My only worry is that the coverage check, which already takes the longest, will be slower with atomic updates.

@grondo
Copy link
Contributor

grondo commented Jul 27, 2024

Anyway, already approved if you want to make the protected branches updates then set MWP. Thanks for the work here!

@trws
Copy link
Member Author

trws commented Jul 28, 2024

I'm trying adding an explicit -fprofile-update=atomic to our coverage flags in hopes the files are getting corrupted by that rather than something more arcane

Oh cool, I searched all around and didn't find that. My only worry is that the coverage check, which already takes the longest, will be slower with atomic updates.

Much as I agree, I’m in favor of having the results be reliable, and would probably propose we split the coverage testing into two runners if we end up needing to. That said, the runtime after the change is actually slightly lower than in another PR currently in flight. I assume that’s just noise in the runs, but it doesn’t seem to have had a meaningful performance impact. Possibly because much of our code is already using atomic updates because it’s visible that it’s threaded at link? Not sure, either way, definitely something to keep an eye on but probably not an immediate issue.

@trws
Copy link
Member Author

trws commented Jul 28, 2024

Ok, check requirements on master all updated, MWP set.

@mergify mergify bot merged commit 3dc3207 into flux-framework:master Jul 28, 2024
33 checks passed
Copy link

codecov bot commented Oct 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.37%. Comparing base (8a2d204) to head (81e233e).
Report is 536 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6128   +/-   ##
=======================================
  Coverage   83.37%   83.37%           
=======================================
  Files         521      521           
  Lines       84653    84653           
=======================================
+ Hits        70577    70582    +5     
+ Misses      14076    14071    -5     

see 10 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants