-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in GEOS-GCM with Intel on AWS ParallelCluster #988
Comments
Well I have never seen that error. From my looking online, I thought perhaps it was due to What does Also, just as a trial, can you try running
do:
Not sure it will help, but it's another clue/test. |
Hmm. I don't see the relink error in that last one. Only the fargparse one that I still think might be related to the GCC version used. |
I am sorry, I got the two issues messed up. No, the relink error is still there for Intel. I will delete the GNU segfault comments above and repost them in that issue. |
I need to recompile again on AWS, will take a bit. Will give you the |
@mathomp4 Finally got to this. Interestingly, same for
|
@mathomp4 I got the same problem on your very own Discover (Milan):
The experiment is in
the build directory is
and the install directory is
Finally, the modules that I load are in
I wonder if the Intel version is a problem? On AWS ParallelCluster, it's |
Well, we can use that version of Intel Fortran and Intel MPI. I think the only issue we've had with it is with MAPL3. But let me be doubly sure. I'm building some Baselibs today anyway, I'll do a "latest with everything" one. Also, what's the backing GCC for this? In my tests with latest ifort I've been using:
|
This is the compiler config:
This is the impi config:
Finally, Python is built by Spack (3.10.13). To "drive" spack, we use the default/OS Python3. |
Welp, my Intel 2021.11 + Intel MPI 2021.11 worked:
Mine is C24 yours is C12 but that shouldn't matter and you are using a few older subrepos than me it looks like, but all those updates were pretty minor. Hmm. |
Doing an
and I don't. Why is that showing up there?? |
Indeed it's only in your executable:
So our Baselibs build doesn't add it, but something in the spack-stack seems to trigger its addition and at the end! |
This is so puzzling. There's nothing special in the spack-stack build that I can think of. |
Well, I am seeing something interesting which points to...something. Namely, Baselibs (for historic reasons), builds pretty much everything as a static library. ESMF and zlib seem to be built as both, but for things like HDF5, netcdf-c and netcdf-fortran, it's static-only all the way. I've never changed that in Baselibs because, well, what we have works and I don't want to rock the boat. So, if you do an |
Also, when I built Baselibs with 2021.11 I was using But as you said above, you got this error on AWS as well with Something in the link chain must be bringing in |
@mathomp4 I did identical builds on SCU16 and SCU17 with Intel. SCU16 runs fine, SCU17 has the above relink issue. I then compare the log files one by one, and the only difference that is persistent across the build process is that |
@mathomp4 @srherbener @AlexanderRichert-NOAA @RatkoVasic-NOAA - this is something for the spack-stack known issues. The error described above, happening on aws pcluster and discover scu17 with [email protected] is resolved by this suggestion from Intel:
Now, the problem is that on most systems, the compilers are provided by the sysadmins, means we'd have to ask them to fix that in the authoritative installation. OR install the compilers ourselves, provided that there's enough space. It's easy after all. Note that I did not really check later versions of oneAPI - but this is a strong indicator that the problem still exists in oneAPI 2024.1:
|
Describe the bug
I am getting this error when I try to run GEOS-GCM (or other trivial executables that are part of GEOS-GCM) compiled with the Intel oneAPI compilers:
I am getting this on two AWS ParallelCluster systems:
I searched for similar errors on the web and found only a few pages; but none of them had any helpful information for my problem.
I am only getting this problem with GEOS-GCM. All other applications (JEDI-UFS, JEDI-MPAS, ...) are working just fine on the same system with the same spack-stack.
To Reproduce
Build spack-stack develop (or, until merged, PR #977) on AWS ParallelCluster (tried versions 3.2.0 and 3.8.0) using Ubuntu (tried 20.04 and 22.04) with Intel oneAPI compilers (see above), then build and set up GEOS using @mathomp4's TinyBCs tarball/scripts/input data.
Expected behavior
It works (like it does on Discover)
System:
See above
Additional context
The same spack-stack, GEOS code and TinyBCs tarball work fine with
[email protected]
on AWS ParallelCluster.I created a ticket for Intel to look at this, too: https://community.intel.com/t5/Intel-C-Compiler/Relink-path-to-libirc-so-with-lib-x86-64-linux-gnu-libc-so-6-for/m-p/1568066/emcs_t/S2h8ZW1haWx8bWVudGlvbl9zdWJzY3JpcHRpb258TFMyQzhDWEMzMk5VRUF8MTU2ODA2NnxBVF9NRU5USU9OU3xoSw#M41713
The text was updated successfully, but these errors were encountered: