Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move MPI and CUDA to Julia extensions #75

Merged
merged 3 commits into from
Apr 29, 2024
Merged

Move MPI and CUDA to Julia extensions #75

merged 3 commits into from
Apr 29, 2024

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Apr 20, 2024

Every package that depends on ClimaComms depends on CUDA and MPI. As a result, user of dowstream packages have to obtain CUDA and MPI even if they don't cannot or don't want to use either of the two.

This PR moves CUDA and MPI to extensions, opening the path to move support for CUDA and MPI in downstream packages to extensions too.

The benefits of using extensions are reduced instantiation and load time (eg, in every GitHub Action runner), but also establishes a pattern of supporting different backends/accelerators in a more modular way instead of entangling everything together. The downside is that extensions are not complete and there are things that cannot be done at the moment. One example is defining new types in the main module (without evaling). This is manifest in this PR, where I pushed the definition of MPISendRecvGraphContext and MPIPersistentSendRecvGraphContext to the extension. A second downside is that extensions are a little brittle right now. For example, there can be errors in precompilation. That said, supporting backends like CUDA is precisely what extensioins were designed for.

Closes #74
Closes #4
Closes #69

@Sbozzolo Sbozzolo force-pushed the gb/move_to_ext branch 11 times, most recently from 3204aa9 to 3bf2e6f Compare April 25, 2024 19:18
Copy link
Member

@charleskawczynski charleskawczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@Sbozzolo Sbozzolo force-pushed the gb/move_to_ext branch 2 times, most recently from c95e994 to 0a816f7 Compare April 26, 2024 15:56
@Sbozzolo Sbozzolo marked this pull request as ready for review April 26, 2024 16:00
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Apr 26, 2024

I added a second commit to try to output more informative log/error messages. @charleskawczynski could you have a second look?

I also tried testing this in ClimaCore/ClimaAtmos, but it is a big pain. ClimaComms has tendrils everywhere, and bumping to 0.6 makes updating really tricky because of circular dependencies of the test environments.

To run with ClimaAtmos, I had to check out all the branches and dev this way:

(examples) pkg> dev ../ClimaComms.jl/ ../ClimaUtilities.jl/ ../RRTMGP.jl/ ../ClimaCore.jl/ ../ClimaTimeSteppers.jl/

I also had to comment out ClimaCoreTempestRemap.

I couldn't get this to work on buildkite because I need to dev everything at the same time and I couldn't make that work.

I tried with ClimaAtmos on 2 gpus on the cluster with:

CLIMACOMMS_DEVICE="CUDA" CLIMACOMMS_CONTEXT="MPI" $MPITRAMPOLINE_MPIEXEC -n 2 julia --project=examples examples/hybrid/driver.jl --config_file config/mpi_configs/mpi_make_restart.yml

And this works (adjusting one thing for the new remapper)

I couldn't perform a more comprehesive test, but things are looking code from the interactive tests I've performed. So, I think we should release ClimaComms 0.6 and move the various other packages one by one. For ClimaUtilities, ClimaTimesteppers, RRTMGP, ClimaCoreTempestRemap the update is easy and non breaking, so we should move them first. Then, we can probably move to ClimaCore and release 0.14, and re-update ClimaUtilities, ClimaTimeSteppers, RRTMGP, ClimaCoreTempestRemap to support ClimaCore 0.14. Finally, we can update ClimaCoupler, ClimaAtmos, and ClimaLand to ClimaCore 0.14 and ClimaComms 0.6

src/loading.jl Outdated Show resolved Hide resolved
Mostly so that we can print a log message informing the user we loaded
MPI/CUDA
@Sbozzolo Sbozzolo merged commit a22e4ac into main Apr 29, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants