Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GTL #716

Merged
merged 27 commits into from
Jul 25, 2023
Merged

Add GTL #716

merged 27 commits into from
Jul 25, 2023

Conversation

JBlaschke
Copy link
Contributor

Yay!!! GPU-Aware MPI breaks more ABIs. Here's an example what happens without loading GTL before a libmpi that needs it:

MPICH ERROR [Rank 0] [job id 5680267.11] [Wed Feb 22 23:01:58 2023] [nid002845] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
 (Other MPI error)

aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

(sorta makes sense I guess 🤨 ... vendors don't want to have to compile two different libmpis ... just insert a libgtl whenever GPUs are around ... yes, that's wayyyy better 😛 )

In the case of some system MPI libraries, GPU-aware MPI is implemented as another library -- bearing the fancy name of GPU Transport Layer (GTL). For example, on Perlmutter it's called libmpi_gtl_cuda.so. Often it's important that this library is loaded before libmpi. These changes do:

  1. Don't change the default behavior
  2. MPIPreferences has an option: gtl_names, which -- if not nothing -- is a list of possible names for the GTL library
  3. MPI will dlopen libgtl before libmpi (if not nothing).

I have tested this on Perlmutter. Will test on Crusher next. Also I don't know if I accidentally broke MPITrampoline, which I will do asap.

This PR represents a tradeoff. Clearly there is no standard way that GTL is defined. So I avoided creating a default search strategy. One could be tempted to look for Cray systems and then "just load GTL". This would cause problems on our CPU nodes, which have the GTL libraries installed (we want to have a single SW image for all nodes), but don't support it (what GPUs? this is a CPU node!).

This PR allows us (the helpful sysadmins) to provide two different Preferences.toml files for each type of node. It does come at the cost of forcing users to potentially having to manage different LocalPreferences.toml files if they have MPI in the LocalPreferences.

@sloede
Copy link
Member

sloede commented Feb 26, 2023

It does come at the cost of forcing users to potentially having to manage different LocalPreferences.toml files if they have MPI in the LocalPreferences.

What would they have to do right now to make it work? Or asked differently: Is this changing things from "inconvenient" to "even more inconvenient" for users on systems without sysadmin support, or from "impossible" to "doable but inconvenient"? If it is the latter, I think it will be an improvement nonetheless, wouldn't it?

@simonbyrne
Copy link
Member

This is quite a cumbersome patch just to deal with Cray MPI, I wish there was a better way. What does mpi4py do?

@simonbyrne
Copy link
Member

simonbyrne commented Feb 26, 2023

Since it's only required at runtime, we could just do it based on environment variables?

This would cause problems on our CPU nodes, which have the GTL libraries installed (we want to have a single SW image for all nodes), but don't support it (what GPUs? this is a CPU node!).

How is this logic handled for C programs?

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 26, 2023

@sloede

What would they have to do right now to make it work? Or asked differently: Is this changing things from "inconvenient" to "even more inconvenient" for users on systems without sysadmin support, or from "impossible" to "doable but inconvenient"? If it is the latter, I think it will be an improvement nonetheless, wouldn't it?

Right now, they would either have to use:

LD_PRELOAD=${CRAY_MPICH_ROOTDIR}/gtl/lib/libmpi_gtl_cuda.so

or add

Libc.Libdl.dlopen("libmpi_gtl_cuda.so", Libc.Libdl.RTLD_GLOBAL)

before the first using MPI line.

Re advice for users who can't ask the sysadmin, I would document an example use_system_binary call -- they would have to do that anyway. So something like:

MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], gtl_names=["libmpi_gtl_cuda", "libmpi_gtl_hsa"], mpiexec="srun")

would cover both Perlmutter and Frontier.

@JBlaschke
Copy link
Contributor Author

This is quite a cumbersome patch just to deal with Cray MPI, I wish there was a better way. What does mpi4py do?

They do the same thing that Cray always tells us to do: "use the compiler wrappers to build mpi4py".

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 26, 2023

@simonbyrne

Since it's only required at runtime, we could just do it based on environment variables?

Urgh ... If it where up to me alone, then sure! Let's put in an env variable. But I kinda like the idea of having preferences managed by ... well ... Preferences (with a capital "P"). Anyway, GTL is part of using the system binary, so I think keeping this alongside libmpi makes sense.

How is this logic handled for C programs?

They are compiled using the Cray compiler wrappers -- I don't know how the compiler wrappers work in detail (no / very limited documentation). I suspect they futz around with the linker to make sure that GTL is linked before MPI. Note: when you build a program with GTL enabled, it can't run on CPU nodes. So the compiler wrappers do insert something....

@simonbyrne
Copy link
Member

Note: when you build a program with GTL enabled, it can't run on CPU nodes.

Ah, that's disappointing. I was hoping there was some magic environment variable set on your GPU nodes that we could rely upon. Ah well.

This PR allows us (the helpful sysadmins) to provide two different Preferences.toml files for each type of node.

Why do you need two Preferences.toml files for each type of node?

@simonbyrne
Copy link
Member

I need to go to bed, but I can get on board with this if we make it a little less Cray-specific: what if we just called it preload, and it could be a list of libraries that get dlopen-ed before libmpi?

@JBlaschke
Copy link
Contributor Author

Ah, that's disappointing. I was hoping there was some magic environment variable set on your GPU nodes that we could rely upon. Ah well.

That's what I was hoping for also...

Why do you need two Preferences.toml files for each type of node?

One with the preloads and one without.

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 26, 2023

@simonbyrne

I need to go to bed, but I can get on board with this if we make it a little less Cray-specific: what if we just called it preload, and it could be a list of libraries that get dlopen-ed before libmpi?

I like this! It would be a bit more effort, but would cover a broader set of use cases. If we can define preloads that depend on an env var (MPICH_GPU_SUPPORT_ENABLED, https://docs.nersc.gov/development/compilers/wrappers/#set-the-accelerator-target-to-gpus-for-cuda-aware-mpi-on-perlmutter ) then we might be able to get away with just one set of preferences. (That env var is turned off on the CPU nodes)

@vchuravy
Copy link
Member

we can define preloads that depend on an env var MPICH_GPU_SUPPORT_ENABLED

So I am wondering if we should do something:

  1. Add a vendor flag to MPIPreferences and try to autodetect cray
  2. When vendor is cray check the environment variable MPICH_GPU_SUPPORT_ENABLED
    and attempt a Libc.Libdl.dlopen("libmpi_gtl_cuda.so", Libc.Libdl.RTLD_GLOBAL)?

Now the big question for me is how to deal with rocm vs cuda... and do we need something like a LD_LIBRARY_PATH (I would expect the module file to set that correctly)?

@JBlaschke
Copy link
Contributor Author

Vendor flags might make a lot of sense for a different reason @vchuravy : as it stands right now, if a user loads say PrgEnv-nvidia then they have to load libmpi_nvidia, not libmpi_cray and so on. Sometimes these are compatible (e.g. libmpi_cray seems to work with PrgEnv-gnu), but not always.

I build different Julia modules for each PE, but if a user rolls their own Julia environment then it might rely on a specific PE. Having smarter logic in MPI.jl would fix that.

Now the big question for me is how to deal with rocm vs cuda

The rocm version of gtl is called libmpi_gtl_hsa. My first attempt would be to try to detect either using find_library. I noticed that Perlmutter has both libmpi_gtl_cuda.so and libmpi_gtl_hsa.so, but find_library only detects the cuda version. Maybe on an AMD machine it's the other way around.

@giordano
Copy link
Member

How is libgtl related to libmpi? The latter requires something (symbols?) from the former? If so, does libmpi dynamically link to libgtl (i.e. what's the output of readelf -d /path/to/libmpi)? If the answer to this last question is no, then something looks broken to me in how this is packaged up, but perhaps I'm missing something.

@vchuravy
Copy link
Member

@giordano my assumption is that they use dlsym to see if the library is preloaded/linked into the binary.

Right now I just want to shout at cray and have everyone use OpenMPI.

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 26, 2023

Yea, there are no symbols the libmpi needs from libmpi_gtl_* that I can see -- otherwise we'd be getting linker errors, not Cray's own error.

What @vchuravy says makes sense. I can't find any documentation on this (other than: if you see this error, recompile with the cray compiler wrappers).

Following this up at NERSC, to see if Cray would be willing to change the behavior of Cray MPICH. We should still work on vendor flags in the meantime.

@simonbyrne
Copy link
Member

Why do you need two Preferences.toml files for each type of node?

One with the preloads and one without.

I don't understand: wouldn't you always want the preloads for the GPU nodes, and no preloads on non-GPU?

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 26, 2023

I don't understand: wouldn't you always want the preloads for the GPU nodes, and no preloads on non-GPU?

You mean CPU? Mainly for sanity: I don't know what GTL will do on a system without GPUs ...

This is also more general: NERSC has a history of systems with different kinds of nodes. @simonbyrne I like your approach of keeping it general. So in general different kinds of nodes might keep libraries in different places, etc. NERSC has been using slurm and the module system to give users a way to deploy their codes on different hardware (e.g. Cori GPU). This also isn't unique to NERSC

@simonbyrne
Copy link
Member

Are you able to join the JuliaHPC meeting on Tuesday? It might be easier to discuss there.

@JBlaschke
Copy link
Contributor Author

Right now I just want to shout at cray and have everyone use OpenMPI.

@vchuravy in 20 years we'll have compatible ABIs https://www.mpich.org/abi/

@JBlaschke
Copy link
Contributor Author

Are you able to join the JuliaHPC meeting on Tuesday? It might be easier to discuss there.

Sadly no. I can do an impromptu call at 9am PT tomorrow (Monday)

@JBlaschke
Copy link
Contributor Author

Quick update: I just confirmed that no preloads are necessary when setting MPICH_GPU_SUPPORT_ENABLED=0 -- so users that don't want GPU-aware MPI, don't need to preload GTL as long as they also have something like export MPICH_GPU_SUPPORT_ENABLED=0 in their environment.

Doesn't get us off the hook completely though, as we still need preload GTL for GPU-aware MPI. The nice thing is that we don't strictly need to not preload GTL, as MPICH_GPU_SUPPORT_ENABLED=0 does seem to turn it off, even when preloaded.

I am going to work on vendor flags regardless, as they might still be useful for autmaticaly adding vendor preloads (e.g. picking the "right" GTL for AMD vs Nvidia)

@simonbyrne
Copy link
Member

What if we were to load the GTL if MPICH_GPU_SUPPORT_ENABLED=1 is set? Can we simply assume that it is in the same directory as the libmpi?

simonbyrne added a commit that referenced this pull request Feb 27, 2023
Possible alternative to #716
@vchuravy
Copy link
Member

What is nm -D on these libraries?

@JBlaschke
Copy link
Contributor Author

What if we were to load the GTL if MPICH_GPU_SUPPORT_ENABLED=1 is set? Can we simply assume that it is in the same directory as the libmpi?

Right, so that's the spirit behind #717 -- that would solve part of the problem, but makes us vulnerable to env var names changing. Also it doesn't help deciding between libmpi_gtl_hsa and libmpi_gtl_cuda. Parsing the output from CC --cray-print-opts=libswould solve that particular problem. GTL libraries are not guaranteed to be in the same directory as MPI. However, they are in theLD_LIBRARY_PATH` so I think we can safely rely on them being loaded by name.

@JBlaschke
Copy link
Contributor Author

What is nm -D on these libraries?

@vchuravy Here you go:

@vchuravy
Copy link
Member

So now the question is what is exported on Frontier/Crusher. My worry is that the symbols overlap and it would only be legal to preload one of them

@JBlaschke
Copy link
Contributor Author

Ah! looks like cleaning up formatting seems to have solved the docs-build problem

@JBlaschke
Copy link
Contributor Author

Can someone familiar with CI comment on what I should do with the failing tests. Right now I don't understand how and if my changes triggered these regressions

@JBlaschke
Copy link
Contributor Author

@simonbyrne any chance you can merge this?

Copy link
Member

@simonbyrne simonbyrne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To try to keep backward compatibility, we should only update the _format key if it uses the new features. Otherwise, we can keep it at "1.0".

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved
lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved
docs/src/configuration.md Outdated Show resolved Hide resolved
docs/src/configuration.md Show resolved Hide resolved
JBlaschke and others added 4 commits July 13, 2023 10:14
Only bump format for where the new version is needed

Co-authored-by: Simon Byrne <[email protected]>
only require v1.1 if vendor is input

Co-authored-by: Simon Byrne <[email protected]>
src/MPI.jl Outdated Show resolved Hide resolved
Co-authored-by: Simon Byrne <[email protected]>
lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

LGTM!

lib/MPIPreferences/src/MPIPreferences.jl Show resolved Hide resolved
src/MPI.jl Outdated Show resolved Hide resolved
lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved
@JBlaschke
Copy link
Contributor Author

JBlaschke commented Jul 20, 2023

Ok, so I've cleaned things up a bit. I moved all the preload logic to MPIPreferences. This way, if the vendor logic changes, then we only need to bump MPIPreferences. It also avoids code duplication. I tested on Perlmutter. Things are working nicely.

@simonbyrne @vchuravy feel free to merge.

@simonbyrne
Copy link
Member

LGTM

Just need to add the docstring to the docs:
https://github.com/JuliaParallel/MPI.jl/actions/runs/5616415928/job/15218697196?pr=716#step:4:21

@JBlaschke
Copy link
Contributor Author

@simonbyrne Docstring added

src/MPI.jl Show resolved Hide resolved
@simonbyrne
Copy link
Member

Can you bump the patch version of MPIPreferences?

@JBlaschke
Copy link
Contributor Author

This looks good -- @simonbyrne do you also want to bump the MPI.jl patch version?

@simonbyrne simonbyrne merged commit fd2c626 into JuliaParallel:master Jul 25, 2023
40 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants