-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orion, Intel environment for spack-stack-1.8.0 breaks the system tar
command
#1355
Comments
This is a known bug in the Intel oneAPI distribution itself. I sent the Intel developers a bug fix for it at the beginning of the calendar year, and I also sent the bug fix to the Orion/Hercules sysadmins. According to @RatkoVasic-NOAA, this problem was fixed for some of the libraries in the oneAPI distribution, but maybe not all? |
Thanks for the response @climbfuji! Very helpful information. @RatkoVasic-NOAA do you think there might be some libraries in the oneAPI installation that have not been repaired yet? And repairing those might address this issue? Thanks! |
@srherbener I avoided that error by purging all loaded modules from my environment, so for spack-stack installation (both on Orion and Hercules) I started with 'module purge' and then all errors associated with "Failed to untar the file"" disappeared. |
Isn't Also, I would be surprised if that really solved the problem - but I'd be happy to be surprised, for sure :-) |
@RatkoVasic-NOAA in the example environment setting (in the description above) I have a call to |
@srherbener what happened to me while installing spack-stack, I was getting same error message as you (I wasn't aware that I haven't purged modules before installation).
Then, I purged modules and error messages disappeared. |
So that means the sysadmins didn't fix anything yet, you just unloaded the modules when you had a problem. I've seen in the past that some applications don't show this problem, while others do. On discover, for example, fv3-jedi would run fine, but geos-jedi failed with the above error. Someone other than a weird dude from a different agency with no purpose on orion/hercules should be making a lot of noise all the way up the hierarchy until the sysadmins fix this. |
Right after logging into orion, I see this:
Then I source our JCSDA, JEDI orion, intel environment script, which does a
Then I try tar:
which breaks. If I wipe out LD_LIBRARY_PATH, then the tar command works:
The After some debugging, I discovered that the issue appears to be that we set LD_LIBRARY_PATH according to the module loads (see the initial description above) places the spack-stack libxcrypto path in front of the system libcrypto path. So when tar executes, the wrong libcrypto library (ie the spack-stack one) gets loaded instead of the correct libcrypto library which is the system one. Unfortunately, we need the LD_LIBRARY_PATH to be set in the order we are getting so that the jedi-bundle build and test all work correctly. |
I see. How about prepending LD_LIBRARY_PATH with system path to libxcrypto in modulefile. Then exec will find that one first and use it instead of spack-stack's? |
The underlying problem however is this:
It only shows up in libcrypto because the spack-stack librypto ldd-s to libimf.so which has the bug I described above. |
I looked into this further and discovered that libimf.so does not appear to have the fault (a shared library that is wrongly marked as a static library) that @climbfuji reported. I tried running
It might be the case that the libifm.so library has not had the libm.so properly linked in, but the libifm.so library appears to be correctly marked as a shared library and running
which appears that the However, I did find that the libirc.so file does indeed have the problem @climbfuji originally reported (libirc.so is involved in the error @climbfuji reported long ago to Intel).
It appears that loading the spack-stack built libcrypt.so.2 library instead of the system libcrypt.so.2 library introduced the intel oneAPI libraries into the mix and somehow got the dynamic loader confused. At this point, I think the pragmatic path forward is to stop investing time now to fix this issue and defer this issue to spack-stack-1.9.0 to give us more time to resolve this issue. In this spirit, I have added post release notes for spack-stack-1.8.0 on the spack-stack wiki describing a manual workaround for this issue (https://github.com/JCSDA/spack-stack/wiki/Post%E2%80%90release-updates-for-spack%E2%80%90stack%E2%80%901.8.0) which can hold us out until 1.9.0. |
I think it is still worthwhile to submit a Priority support ticket to intel about the libirc.so issue. Someone with a NOAA email can submit a Priority support ticket here: https://supporttickets.intel.com/. Unfortunately, my NOAA email has been deactivated and my UCAR email does not grant me access to a Priority support request. Could someone at NOAA please submit the Priority support ticket on my behalf? Thanks! Here is text that we could place in the ticket: We have intel oneAPI version 2023.1.0 installed on one of our HPC platforms and we are running into trouble with using the oneAPI provided libirc.so library. This library appears to be intended to be loaded as a shared library, but the dynamic loader thinks it is a static library and fails to load the depndencies of libirc.so ultimately causing undefined reference error. Running ldd on the installed libirc.so reveals that this file is understood to be a static library: herbener$ ldd /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so However, running the file command on the installed libirc.so file indicates the file is a shared library: herbener$ file /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so Running the following patchelf command is known to fix this issue, patchelf --add-needed libc.so.6 path-to-libirc.so-file but we would like to not have to negotiate with our HPC provider IT group to implement this workaround. Can we please get this addressed in the oneAPI installation so that the libirc.so file is properly understood by the dynamic loader to be a shared library. |
The Intel Priority Support ticket number is:
|
This ticket has been closed with the explanation that libirc.so is marked static intentionally, and we need to figure out how to use libintlc.so in its place. Not very helpful, but at least we know a little more about this issue. |
Describe the bug
The Orion, Intel spack-stack-1.8.0 environment, specifically the LD_LIBRARY_PATH setting, interferes with the execution of the system
tar
command. See the next section on reproducing the error.Simply running
tar
outside of the ecbuild command gets the same failure. After some tracing it appears that things go awry when loading the gzip functionality, where the spack-stack/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/intel/2021.9.0/libxcrypt-4.4.35-ebrdc3w/lib/libcrypt.so.2
shared library gets loaded instead of the system libcrypt library (/usr/lib64/lib/libcrypt.so.2
).I found that if LD_LIBRARY_PATH is unset (or /usr/lib64/lib is prepended to the front) that the tar command then works properly.
I need help with coming up with a workable fix for this. I've tried
Does anyone have any ideas about how to address this?
Also, could someone check my environment setup to make sure I'm not missing something.
Thanks!
To Reproduce
Steps to reproduce the behavior:
load the intel environment by sourcing a script file that contains the following sequence:
The export and module use commands near the end are the workaround to get the proper udunits package to load.
Run ecbuild:
This results in the following error:
Expected behavior
The tar command run from the CRTM CMake configuration should complete successfully.
System:
Orion, Intel
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: