-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update containers: el9, fedora40, noble, add arm64 for fedora40, el9, noble #6128
Conversation
7995d61
to
c0965c3
Compare
Almost everything works, but there's an issue with mpi on the two most recent distros, fedora40 and noble. It makes me wonder if there's another issue with the PMI in the newer mpich maybe? The problem goes away in the test that sets PSM3_HAL (not sure what that does, or why, but it fixes it 🤷). @garlick is there a chance this is related to the PMI issue you mentioned with hydra/mpich? I remember that was going the other way, but the cause of the bootstrap issue isn't jumping out at me. |
Right, I should have mentioned I saw that was fixed, sorry. The current behavior seems to be that "flux run -n 8 abort" say runs eight rank 0 processes. Everything works in the one test that sets PSM3_HAL though? And I can't offhand remember what that does. Clearly something going funny with PMI so I was vaguely hoping it was related, maybe needing a fix on the mpich side or a different workaround from us. |
Looks like the PSM_HAL variable was set to address mpich problems in fedora39. Presumably we need it for fedora40 as well. See also (Not sure I understand yet why that is needed) |
Thanks for that @garlick. I'm guessing it's related to MPI version, but it looks like what that does is turn off all multi-node features... that's potentially bad. Will do a bit more digging but this seems like an issue we may legitimately need a fix for, maybe we're picking up an MPI that needs PMIX? |
The MPI tests do check for the "N singletons" type of error because that is a common one. For example test_expect_success 'mpi hello various sizes' '
run_timeout 30 flux submit --cc=1-$MAX_MPI_SIZE $OPTS \
--watch -n{cc} ${HELLO} >hello.out &&
for i in $(seq 1 $MAX_MPI_SIZE); do \
grep "There are $i tasks" hello.out; \
done
' I think The PSM3 environment variable just needs to be set for fedora because fedora now ships an environment module with mpich (based on my comment in #5694) and presumably they included PathScale support that has to be suppressed. I didn't look into why it's set for one fedora builder but not the other |
Have I mentioned I'm frustrated with Ubuntu lately? This is just absolutely bonkers: https://bugs.launchpad.net/ubuntu/+source/mpich/+bug/2072338 They linked their mpich with libpmix in noble, so not only do we not work with mpich, neither does hydra. 🤯 |
🤦🤦🤦🤦🤦 |
45dc5d0
to
d76843a
Compare
It is a bit disturbing that this change drops code coverage by >5%
|
I think that might be because the el9 test just wouldn't finish. If it's still like that when it's all finishing, I'm planning to dig into it more. Maybe we're only doing coverage on an RPM distro now rather than both deb and rpm? Not sure, either way, definitely want to fix that before changing check requirements and merging this. |
Ah, we only do two coverage runs, they take too long. Currently I think there's a a coverage build that uses the default image, then one that just runs the "system" tests. |
d31c3cb
to
240de5c
Compare
Ok, don't have the new codecov comment yet, but all the coverage posted and the python coverage is reporting the same numbers as on master. This should be ready for a review. |
Let me take another look, but if all the tests are passing then it is probably all good!
It is probably fine to drop focal support in flux-sched whenever, nothing says that other projects have to support the same distros as flux-core. Looks like focal doesn't reach EOL until Apr 2025, though I doubt we have many users (though I still have a couple vms at that version) |
Oh, one reminder: We should update branch protections just before merging this one (we may have to since there are names that now no longer apply.) |
Also, it appears some subprojects use focal, it might be best to just keep that for now until they update, or we can transition them to newer images before landing this PR. (flux-coral2, flux-pmix, and flux-pam are all affected) |
Yup, various subprojects still use the older images since I started from the bottom. The theory on it being about time for focal is that it's two LTS releases ago, and the only distro we support that can't provide a reasonably recent compiler (the newest in focal repos seems to be gcc-10). The RHEL images have gone from some of the most difficult to deal with to some of the easiest, because they have been doing such a good job supporting new compilers, heck even lassen has gcc-12 on it from upstream rhel repos. That's why the older rhel isn't causing similar issues. Anyway, I can certainly add the docker build/tag back in here while we address the downstream projects, will make that change. |
Yeah, this makes sense to me! |
problem: we don't have recent distros, and some of them take a _long_ time to build because of repeated fetches from slow ports repos (aarch64 repos are really slow on ubuntu for some reason, so repeated apt update == bad) solution: add el9, fedora40, and ubuntu noble while refactoring noble to use a single apt update and apt install pair as well as switching el9 to use dnf (much faster than yum, and doesn't require separate update)
problem: We need to test the newer containers, and have only the one image built for arm64 solution: update several of our tests to new container versions, build noble, fedora40, el8 for arm, focal remains but should be removed in the near future
The focal docker build has been restored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Just spotted one possible leftover debug line in one of the commits.
Also, had to restart the coverage build because we hit an instance of #6078
t/t3003-mpi-abort.t
Outdated
|
||
set -x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, thanks for that.
Not sure if we just want throw on a commit that sets |
problem: some hwloc versions try to connect to the x11 port if it exists causing all kinds of problems solution: set the env var to tell hwloc to not do that
problem: new docker images are not pushed with combined manifests solution: push manifests in github actions
problem: checks_run only reports coverage if the initial check succeeds, but reports success if recheck succeeds. That hides errors in the coverage reporting and means we don't get some of our test coverage whenever a spurious test failure happens during coverage solution: move coverage reporting into a new POSTCHECKCMDS variable that we evaluate if check succeeds or if recheck succeeds
problem: we keep getting merge issues with gcov files. It seems the default for updating these is "single" meaning single threaded unless the compilation is done with `-pthread`. I'm guessing we see the errors most on libraries we use in multiple places that are not themselves compiled that way, even though they're eventually linked into binaries that are. If this doesn't cover it, we should also do what's listed in solution: add `-fprofile-update=atomic` to our coverage flags
Thanks for the pointer, I had missed that issue. Since we run tests in parallel, and some of these libraries get used in threads, I'm trying adding an explicit |
Looks like we're green here pending branch check updates. |
Oh cool, I searched all around and didn't find that. My only worry is that the coverage check, which already takes the longest, will be slower with atomic updates. |
Anyway, already approved if you want to make the protected branches updates then set MWP. Thanks for the work here! |
Much as I agree, I’m in favor of having the results be reliable, and would probably propose we split the coverage testing into two runners if we end up needing to. That said, the runtime after the change is actually slightly lower than in another PR currently in flight. I assume that’s just noise in the runs, but it doesn’t seem to have had a meaningful performance impact. Possibly because much of our code is already using atomic updates because it’s visible that it’s threaded at link? Not sure, either way, definitely something to keep an eye on but probably not an immediate issue. |
Ok, check requirements on master all updated, MWP set. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6128 +/- ##
=======================================
Coverage 83.37% 83.37%
=======================================
Files 521 521
Lines 84653 84653
=======================================
+ Hits 70577 70582 +5
+ Misses 14076 14071 -5 |
This ended up being more churn than I intended, but once I got into it here's where I ended up.
Effects:
All of these are built and pushed for all appropriate architectures. The bookworm, jammy and alpine tweaks shouldn't have any impact on existing tests, since they're additions or moving things, the rest are new containers at new names.
TODO: