Running Julia/Oceananigans on HPCs #2319

iuryt · 2022-03-07T19:05:15Z

iuryt
Mar 7, 2022
Collaborator

Hi,

I know that it might seem too general, but I still think it's worth having a discussion about the "best" ways to run this beautiful model on HPCs.

I have been struggling to do that for some of the following reasons:

The only julia available from modules is old (1.5.0)
If I install a latest version from the official generic binaries, it returns error while precompiling new packages
The latest available cuda version is 8.0
Now I am trying to install Julia from source, but already got "segmentation fault"

If we want to make it easier for people to use, is Docker an interesting option for Oceananigans?

glwagner · 2022-03-07T19:26:05Z

glwagner
Mar 7, 2022
Maintainer

@iuryt which linux architecture is your cluster, out of curiosity?

3 replies

iuryt Mar 7, 2022
Collaborator Author

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.10 (Santiago)
Release: 6.10
Codename: Santiago

glwagner Mar 7, 2022
Maintainer

x86 then?

iuryt Mar 7, 2022
Collaborator Author

$ uname -m
x86_64

glwagner · 2022-03-07T20:14:05Z

glwagner
Mar 7, 2022
Maintainer

I also just happened across this tool: https://github.com/johnnychen94/jill.py which might be useful to some people.

A few notes about determining which binary to download:

Issuing the uname -m often returns the linux architecture. On x86 I find

$ uname -m
x86_64

whereas on a Power 9 system I obtain

$ uname -m
ppc64le

(more on that later). For ARM a StackOverflow post might help (and here's a similar post for x86).

There are also options for glibc and the musl libc. They say that most users should use glibc. More about musl libc here.

I don't know what GPG means so hopefully someone can chime in there.

Finally a screenshot for reference:

0 replies

glwagner · 2022-03-07T20:15:28Z

glwagner
Mar 7, 2022
Maintainer

If we want to make it easier for people to use, is Docker an interesting option for Oceananigans?

I'm not sure... is the issue installing Julia, or installing Oceananigans? How would docker help?

1 reply

iuryt Mar 7, 2022
Collaborator Author

I don't know if that makes sense, but this might be more general than Oceananigans. I was thinking if running a preconfigured docker wouldn't be easier for running Julia on a HPC, and if it would affect the performance anyhow.

christophernhill · 2022-03-07T20:42:28Z

christophernhill
Mar 7, 2022
Maintainer

@glwagner GPG is just a signature for the file. It provides an install check that the download isn't install some malicious software masquerading as an official download.

@iuryt I think a challenge you have is the system you are running on has RH6.1 which is quite old. Docker/Singularity could help a bit, but the GPU piece is awkward. Even with Docker/Singularity the GPU piece needs up to date underlying OS drivers. Perhaps you could ping your sys admin folks and find out if they have plans to upgrade? From this https://access.redhat.com/support/policy/updates/errata it looks like RHEL 6 was possibly last fully supported about 6 years ago?

2 replies

glwagner Mar 7, 2022
Maintainer

@glwagner GPG is just a signature for the file. It provides an install check that the download isn't install some malicious software masquerading as an official download.

🙇 That explains the search results I was getting on "Gnu Privacy Guard". Thank you.

iuryt Mar 9, 2022
Collaborator Author

Thanks @christophernhill
I checked and they are upgrading to RH8 (https://www.umassrc.org/wiki/index.php/Cluster_Upgrades).
After having problems to install packages there, I shifted to the local UMass Dartmouth cluster which use
CentOS Linux release 7.5.1804 (Core), which I believe is still supported until 2024 (https://wiki.centos.org/About/Product) but now I am having problems with MPI and CUDA posted below.

iuryt · 2022-03-09T17:15:52Z

iuryt
Mar 9, 2022
Collaborator Author

I could setup Julia-1.6.5 on the local UMassD cluster and while importing Oceananigans, I received this message:

julia> using Oceananigans
┌ Warning: You appear to be using MPI.jl with the default MPI binary on a cluster.
│ We recommend using the system-provided MPI, see the Configuration section of the MPI.jl docs.
└ @ MPI ~/.julia/packages/MPI/08SPr/deps/deps.jl:15
┌ Warning: This version of CUDA.jl only supports NVIDIA drivers for CUDA 10.1 or higher (yours is for CUDA 9.1.0)
└ @ CUDA ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:38
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 9.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:42
┌ Error: Could not parse CUDA version info ("nvdisasm: NVIDIA (R)
│ Copyright (c) 2005-2016 NVIDIA Corporation
│ Built on Wed_Jan_11_20:20:40_CST_2017
│ "); please file an issue.
└ @ CUDA.Deps ~/.julia/packages/CUDA/DL5Zo/deps/discovery.jl:467

What is weird is that I am currently using cuda/10.0 (L,D) from the cluster modules.
What should I do?

4 replies

iuryt Mar 9, 2022
Collaborator Author

If I specify to use V100 passing --gres=gpu:V100:1, it only returned

┌ Warning: You appear to be using MPI.jl with the default MPI binary on a cluster.
│ We recommend using the system-provided MPI, see the Configuration section of the MPI.jl docs.
└ @ MPI ~/.julia/packages/MPI/08SPr/deps/deps.jl:15

and imported Oceananigans

glwagner Mar 10, 2022
Maintainer

Nice! It's possible that only some of the nodes (eg the ones for GPU computation specificallly) have the latest CUDA drivers?

navidcy Mar 10, 2022
Maintainer

I could setup Julia-1.6.5 on the local UMassD cluster and while importing Oceananigans, I received this message:

julia> using Oceananigans
┌ Warning: You appear to be using MPI.jl with the default MPI binary on a cluster.
│ We recommend using the system-provided MPI, see the Configuration section of the MPI.jl docs.
└ @ MPI ~/.julia/packages/MPI/08SPr/deps/deps.jl:15
┌ Warning: This version of CUDA.jl only supports NVIDIA drivers for CUDA 10.1 or higher (yours is for CUDA 9.1.0)
└ @ CUDA ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:38
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 9.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:42
┌ Error: Could not parse CUDA version info ("nvdisasm: NVIDIA (R)
│ Copyright (c) 2005-2016 NVIDIA Corporation
│ Built on Wed_Jan_11_20:20:40_CST_2017
│ "); please file an issue.
└ @ CUDA.Deps ~/.julia/packages/CUDA/DL5Zo/deps/discovery.jl:467

What is weird is that I am currently using cuda/10.0 (L,D) from the cluster modules. What should I do?

can you use cuda/11.X.X?

iuryt Mar 10, 2022
Collaborator Author

They don't have cuda/11.X.X installed. I will try using the nodes that have V100 and let you know. What I think it's happening is that the other nodes has a GPU that does not support newer versions of cuda.

glwagner · 2022-04-21T18:38:58Z

glwagner
Apr 21, 2022
Maintainer

Another tip is to use interactive slurm sessions to be able to use Julia interactively (thus solving issues with compile time, especially if source code changes and Revise are needed).

For this I use the command

srun --gres=gpu:4 -N 1 -n 1 --mem=0 -t 12:00:00 --pty /bin/bash

which I put an alias in my .bashrc:

alias getnode='srun --gres=gpu:4 -N 1 -n 1 --mem=0 -t 12:00:00 --pty /bin/bash'

then typing

$ getnode

at the terminal requests an interactive session on one node with 4 GPUs (4 can be increased to the number of GPUs available per node on the given cluster). I can then use tmux to open 4 panes each with its own environment variable CUDA_VISIBLE_DEVICES and thus it's on GPU and have 4 interactive julia sessions at the same time. I then alternate launching compile tasks and writing code to make things more productive.

To change the time requested for the interactive job, change 12:00:00. This may shorten the time a request sits in the queue.

Once the interactive job has been allocated, additional terminals may be opened on the node by typing

$ ssh nodeXXXX

(at least, this works on the clusters I work with --- if a different ssh instruction is needed for other systems please report here!)

1 reply

iuryt Apr 21, 2022
Collaborator Author

If you have different types of GPU and some doesn't have newer CUDA versions or is not CUDA-enabled, you can also specify the GPU you want with --gres=gpu:<GPU_name>:1. For instance, to request a V100 you do --gres=gpu:V100:1 .

raphaelouillon · 2022-04-22T12:14:16Z

raphaelouillon
Apr 22, 2022

I am facing similar challenges as @iuryt trying to run on Satori or Stampede2 (CPU only). I have seen discussions about using both HPCs here and there, but all date back a little while. Curious to hear if anyone has been successful at running Oceananigans on either recently!

6 replies

raphaelouillon Apr 22, 2022

Hi @iuryt I am using Julia 1.6.5 with Oceananigans v0.75.0. My issue is when trying to run on GPU on Satori:

ERROR: could not load library "libcuda.so.1" libcuda.so.1: cannot open shared object file: No such file or directory
I am seeing now that I can start an interactive session on a work node so that should help with the debugging. On Stampede2 I haven't tried in a while, but I remember struggling to even install Julia 1.6 from source. Will try again soon and report!

raphaelouillon Apr 22, 2022

@iuryt the initial problem simply came from not being on a work node with gpu resources allocated. I am now running into a different error when calling set!(model, ...) :

ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/error.jl:105
 [2] query
   @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:102 [inlined]
 [3] synchronize(stream::CUDA.CuStream; blocking::Bool)
   @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:117
 [4] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:117 [inlined]
 [5] top-level scope
   @ ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:54

This might have nothing to do with Satori, I'll try and test on my own machine first.

iuryt Apr 22, 2022
Collaborator Author

Do you have a CUDA-enabled GPU on your own machine?
On the HPC you need to check CUDA version. Does Oceananigans give any message when you run the import for the first time? Usually it gives some warning if you have a different CUDA version. You can also check which GPUs your HPC has. I usually request a V100 on the batch script to make sure I can run Oceananigans.
You can do that adding the following command for SLURM
--gres=gpu:V100:1

There is probably an equivalent for PBS.

glwagner Apr 22, 2022
Maintainer

@iuryt, Satori is an MIT GPU-only cluster, so we better have GPUs 😂 It only has V100s. But I think the issue here might be loading modules, and/or building or re-building CUDA.jl against the currently loaded CUDA library.

I'm also having trouble loading cuda on MIT Satori. @simone-silvestri @sandreza @christophernhill what modules do you need to load to run on Satori's GPUs? On a login node I get

glwagner@satori-login-002:examples$ module load cuda
glwagner@satori-login-002:examples$ nvidia-smi
-bash: nvidia-smi: command not found

I thought previously nvidia-smi would work on the login node so not sure what happened. I also can't get CUDA.jl to work, even after re-building eg

$ module load cuda
$ julia -e 'using Pkg; Pkg.build("CUDA")'

EDIT: maybe this just doesn't work on the login node (I think it used to, but possibly we can't use GPUs on the login node because it was abused / we can't have nice things). But, I ran the same lines above on an interactive node and things seem to work.

christophernhill Apr 22, 2022
Maintainer

Uh-oh 💣 😞
I'll try something....

I have a feeling the login node GPUs are in compute nodes now?

christophernhill · 2022-04-22T19:55:24Z

christophernhill
Apr 22, 2022
Maintainer

@raphaelouillon and @iuryt this worked (for me)

ssh -l cnh satori-login-001.mit.edu
srun --time=12:00:00 --gres=gpu:1 --pty -I /bin/bash

[cnh@node0040 julia-17]$ nvidia-smi
Fri Apr 22 15:33:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   46C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

mkdir -p /nobackup/users/cnh/projects/julia-test
cd /nobackup/users/cnh/projects/julia-test
export JULIA_DEPOT_PATH=`pwd`/.julia
/nobackup/users/cnh/projects/julia-17/julia-1.7.2/julia 
julia> using Pkg
julia> Pkg.add("CUDA")
julia> using CUDA
julia> N=1000
julia> x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)

  Downloaded artifact: CUDA
1000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

julia>

I seem to have


[cnh@node0040 julia-test]$ module list

Currently Loaded Modules:
  1) spack/0.1   2) StdEnv

2 replies

raphaelouillon Apr 25, 2022

Thanks @christophernhill, I was able to reproduce the CUDA test you posted, however I still get the same error when running my Oceananigans script:

ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

I will try and isolate what is causing the issue.

glwagner Apr 25, 2022
Maintainer

@raphaelouillon you might want to try Julia 1.7 too.

christophernhill · 2022-04-22T20:00:06Z

christophernhill
Apr 22, 2022
Maintainer

Note - Julia does not support ppc64le (Satori) and possibly not much KNL (Stampede2 I think) binaries. For Satori, compiling from source ( https://github.com/JuliaLang/julia/releases/download/v1.7.2/julia-1.7.2.tar.gz ) seemed to work?

4 replies

glwagner Apr 22, 2022
Maintainer

That's very important to mention! There are no binaries for ppc64le architecture, so we have to build Julia ourselves for Satori.

iuryt Apr 22, 2022
Collaborator Author

But until now, Oceananigans only works for Julia 1.6.5, right?

glwagner Apr 23, 2022
Maintainer

No, it does work with 1.7. It's just that there's some work to do to update the tests to be 1.7-compatible, so we don't run CI on 1.7 yet.

christophernhill Apr 23, 2022
Maintainer

That's very important to mention! There are no binaries for ppc64le architecture, so we have to build Julia ourselves for Satori.

🥲 yes that is pretty important!!!

@iuryt this seems to work .....

ssh ....
module load gcc/9.3.0-xy7chq4
mkdir -p /nobackup/users/cnh/projects/jbuild-test
cd /nobackup/users/cnh/projects/jbuild-test
wget https://github.com/JuliaLang/julia/releases/download/v1.6.6/julia-1.6.6.tar.gz
tar -xzvf julia-1.6.6.tar.gz 
cd julia-1.6.6
make
   :
   :
   :
[cnh@satori-login-001 julia-1.6.6]$ ./julia 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.6 (2022-03-28)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |

julia>

etc.....

raphaelouillon · 2022-04-27T14:04:03Z

raphaelouillon
Apr 27, 2022

@iuryt @glwagner @christophernhill, building Julia 1.7.2 from source did it for me (for some reason I had issues building 1.6.6). Thanks all!

0 replies

raphaelouillon · 2022-05-05T08:29:18Z

raphaelouillon
May 5, 2022

@glwagner @christophernhill apologies if this is readily addressed in the documentation but I couldn't find the info: When running on several GPUs, how does the memory usage per GPU change? Does it decrease more or less linearly with number of GPUs? I am looking at running simulations with 10⁸ grid cells or more and memory seems to be the limiting factor on GPU (at least on the V100 with 32gb of memory). This also got me wondering if anyone had tried running on the M1 Ultra architecture with 128Gb of unified memory (I saw that there was a recent discussion on this). Am I missing something or would that give the M1 Ultra more memory than an Nvidia A100? I also imagine that it would be much slower, but if memory is the bottleneck, still potentially interesting to try.

2 replies

glwagner May 5, 2022
Maintainer

We don't support multi GPU yet, though we have started down that path with #2253 .

I don't know how M1 works, but unified memory may not be a "drop in" replacement for a CuArray (which is the fundamental data structure we use for GPU). That means that we need specific features for multi GPU like the ones implemented in #2253 . @simone-silvestri may be able to comment more.

simone-silvestri May 5, 2022
Maintainer

@raphaelouillon the multi GPU feature is introduced for hydrostatic models in #2235 but it's still a bit experimental. Nonetheless, I am also trying to set up a simulation with 3e8 grid cells and it fits comfortably in the 4 Tesla V100 of a Satori node. From the memory usage I see 1e9 cell points should fit on a Satori node provided you don't have a lot of forcing through arrays / diagnostic fields.

let me know if you want to try using the MultiRegion feature, it might already work for your case if you are using an hydrostatic model

Running Julia/Oceananigans on HPCs #2319

iuryt Mar 7, 2022 Collaborator

Replies: 11 comments · 25 replies

glwagner Mar 7, 2022 Maintainer

iuryt Mar 7, 2022 Collaborator Author

glwagner Mar 7, 2022 Maintainer

iuryt Mar 7, 2022 Collaborator Author

glwagner Mar 7, 2022 Maintainer

glwagner Mar 7, 2022 Maintainer

iuryt Mar 7, 2022 Collaborator Author

christophernhill Mar 7, 2022 Maintainer

glwagner Mar 7, 2022 Maintainer

iuryt Mar 9, 2022 Collaborator Author

iuryt Mar 9, 2022 Collaborator Author

iuryt Mar 9, 2022 Collaborator Author

glwagner Mar 10, 2022 Maintainer

navidcy Mar 10, 2022 Maintainer

iuryt Mar 10, 2022 Collaborator Author

glwagner Apr 21, 2022 Maintainer

iuryt Apr 21, 2022 Collaborator Author

raphaelouillon Apr 22, 2022

raphaelouillon Apr 22, 2022

raphaelouillon Apr 22, 2022

iuryt Apr 22, 2022 Collaborator Author

glwagner Apr 22, 2022 Maintainer

christophernhill Apr 22, 2022 Maintainer

christophernhill Apr 22, 2022 Maintainer

raphaelouillon Apr 25, 2022

glwagner Apr 25, 2022 Maintainer

christophernhill Apr 22, 2022 Maintainer

glwagner Apr 22, 2022 Maintainer

iuryt Apr 22, 2022 Collaborator Author

glwagner Apr 23, 2022 Maintainer

christophernhill Apr 23, 2022 Maintainer

raphaelouillon Apr 27, 2022

raphaelouillon May 5, 2022

glwagner May 5, 2022 Maintainer

simone-silvestri May 5, 2022 Maintainer

iuryt
Mar 7, 2022
Collaborator

Replies: 11 comments 25 replies

glwagner
Mar 7, 2022
Maintainer

iuryt Mar 7, 2022
Collaborator Author

glwagner Mar 7, 2022
Maintainer

iuryt Mar 7, 2022
Collaborator Author

glwagner
Mar 7, 2022
Maintainer

glwagner
Mar 7, 2022
Maintainer

iuryt Mar 7, 2022
Collaborator Author

christophernhill
Mar 7, 2022
Maintainer

glwagner Mar 7, 2022
Maintainer

iuryt Mar 9, 2022
Collaborator Author

iuryt
Mar 9, 2022
Collaborator Author

iuryt Mar 9, 2022
Collaborator Author

glwagner Mar 10, 2022
Maintainer

navidcy Mar 10, 2022
Maintainer

iuryt Mar 10, 2022
Collaborator Author

glwagner
Apr 21, 2022
Maintainer

iuryt Apr 21, 2022
Collaborator Author

raphaelouillon
Apr 22, 2022

iuryt Apr 22, 2022
Collaborator Author

glwagner Apr 22, 2022
Maintainer

christophernhill Apr 22, 2022
Maintainer

christophernhill
Apr 22, 2022
Maintainer

glwagner Apr 25, 2022
Maintainer

christophernhill
Apr 22, 2022
Maintainer

glwagner Apr 22, 2022
Maintainer

iuryt Apr 22, 2022
Collaborator Author

glwagner Apr 23, 2022
Maintainer

christophernhill Apr 23, 2022
Maintainer

raphaelouillon
Apr 27, 2022

raphaelouillon
May 5, 2022

glwagner May 5, 2022
Maintainer

simone-silvestri May 5, 2022
Maintainer