Utilizing Unified Memory for Oceananigans simulations #3592

BrodiePearson · 2024-05-08T20:58:32Z

BrodiePearson
May 8, 2024
Collaborator

Would it be helpful/appropriate to add some information about Unified Memory utilization in the Simulation Tips/GPU section? If so, I'd be happy to draft a short new bullet point for the "decrease memory use of your runs" section of the documentation.

I've recently run Oceananigans non-hydrostatic (3D isotropic) simulations utilizing Unified Memory on new NVIDIA hardware (GH200; Grace Hopper Superchips). Unified Memory allows bigger simulations on a single GPU (>1 billion grid points in this case), compared to using the GPUs Device Memory (which is the on-GPU RAM we typically use), but it comes at the cost of a slower simulation due to interconnect bandwidth. Some preliminary simulations on the GH200 (see below) are 2-3x larger but are up to 3x slower than expected by extrapolating Device Memory scalings.

As background:

The GH200 architecture has GPUs and CPUs integrated on a single chip, with 96GB of GPU Device Memory, and 480GB of Unified Memory which can be quickly accessed by both CPUs and the GPU (albeit not as quick-access as the GPU's own memory pool).
Having both unified and device memory seems to be the direction NVIDIA is moving (Grace Blackwell GB200, etc.), and is also the structure used by Apple Silicon (although as far as I understand it, CUDA does not work on Apple Silicon), so this functionality may become more generally useful soon.
Although I expect GPU-parallelization is the route to the most complex simulations, Unified Memory utilization could help users who only have access to single-GPUs or GPU arrays with low-bandwidth interconnects.
To utilize Unified Memory, CUDA.jl has recently-added functionality to adjust the default memory utilized for CUDA Arrays etc. After lots of trial and error, I found that I just had to find my Julia environment's LocalPreferences.toml file and add the following lines to it:

[CUDA]
default_memory = "unified"

{width=50}

glwagner · 2024-05-08T21:38:50Z

glwagner
May 8, 2024
Maintainer

That is really interesting!

I found that I just had to find my Julia environment's LocalPreferences.toml file and add the following lines to it:

What specifically does that line do? Is there documentation for CUDA.jl that you can share? Also to clarify, if this is included, then the simulation will run more slowly in cases that don't require unified memory, is that right?

is also the structure used by Apple Silicon (although as far as I understand it, CUDA does not work on Apple Silicon)

CUDA is specific to NVidia hardware. For Apple Silicon, we can use Metal.jl (which is supported by KernelAbstractions. There's been a little work to get this working #3288. But it doesn't seem to be a priority for anyone --- could be an interesting side or undergraduate project, though. It's probably not too much work, but does require a little spin up to become familiar with writing Julia extensions and also figuring out the necessary translations in Oceananigans source code (which are not many).

It would be interesting to test the hydrostatic model in a configuration that doesn't have FFTs and to compare performance between unified memory and explicit parallelization. @simone-silvestri might be interested in that.

12 replies

glwagner May 9, 2024
Maintainer

I put together a repo to illustrate how to do this:

https://github.com/glwagner/ReallyBigSimulations/tree/main

BrodiePearson May 9, 2024
Collaborator Author

@glwagner Great, that repo is a nice concise example (although this_is_big.jl may be a misnomer haha).

Perhaps something to the effect of this could be added to the bullet points in this section of the documentation?

Some hardware includes a unified memory pool shared between both GPU(s) and CPUs, which can be larger than the GPU device memory that we typically use for single-GPU computations. If unified memory is available, you may be able to conduct a larger simulation than is possible with the device memory (for example, grids a bit larger than 1024 x 1024 x 1024 are possible on a Grace Hopper Superchip with 480GB unified memory). To access unified memory for an Oceananigans simulation, you can use a LocalPreferences.toml file as described in this example. It is not recommended to use unified memory for problems that could fit on device memory because the GPU and unified memory typically have a lower-bandwidth interconnect than the GPU and device memory.

Also,

I'm not sure that would affect us --- I think that we allocate all of our memory intentionally / in a function, and we don't rely on implicit allocation via operations like 2 * b. But it would be a good test that this is indeed true...

That's good to know, I knew that each array is initially allocated to the chosen memory, but didn't realize this memory allocation is done again when arrays are manipulated (in my example, c was initially a Unified Memory array, but overwriting it without specifying memory changed it to default_memory).

glwagner May 9, 2024
Maintainer

(in my example, c was initially a Unified Memory array, but overwriting it without specifying memory changed it to default_memory).

Yes but let me clarify. You didn't overwrite the array, you just overwrote the name c. More precisely, two things occur if you write c = 2 * a. First, 2 * a has to be computed and stored somewhere. So memory is allocated, and apparently, it used the default memory. The next step, which really is a distinct step, is to make the name c point to that newly allocated memory. You could have also written c .= 2 .* a. But because c=a before this line, this computation would actually be stored in the memory space that a points to --- which may not have been your intent. For example:

julia> a = rand(3, 3)
3×3 Matrix{Float64}:
 0.0646543  0.869536   0.485728
 0.929745   0.100563   0.34596
 0.720914   0.0197468  0.792766

julia> c = a
3×3 Matrix{Float64}:
 0.0646543  0.869536   0.485728
 0.929745   0.100563   0.34596
 0.720914   0.0197468  0.792766

julia> c .= 2 .* a
3×3 Matrix{Float64}:
 0.129309  1.73907    0.971457
 1.85949   0.201125   0.691919
 1.44183   0.0394937  1.58553

julia> a
3×3 Matrix{Float64}:
 0.129309  1.73907    0.971457
 1.85949   0.201125   0.691919
 1.44183   0.0394937  1.58553

Notice that a is now different. That's because c .= 2 .* a is exactly the same as a .= 2 .* a or a .*= 2.

To exert more control over memory allocation, you might try this pattern instead:

julia> a = rand(3, 3)
3×3 Matrix{Float64}:
 0.228287  0.263259  0.852691
 0.787519  0.697647  0.939554
 0.126515  0.684663  0.970198

julia> b = similar(a) # create a new array that is "similar" to a. deepcopy(a) probably works too
3×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> b .= 2 .* a # broadcast the operation 2 * a over all the elements of a and store in b
3×3 Matrix{Float64}:
 0.456573  0.526517  1.70538
 1.57504   1.39529   1.87911
 0.253029  1.36933   1.9404

Notice the . they are very important. And then if you try

julia> c = 2 .* a # create a new array by computing 2 * a and assign the name `c` to that array
3×3 Matrix{Float64}:
 0.456573  0.526517  1.70538
 1.57504   1.39529   1.87911
 0.253029  1.36933   1.9404

julia> b === c # b and c point to different underlying memory (even though they are numerically equal)
false

BrodiePearson May 9, 2024
Collaborator Author

Thanks, that makes more sense now and I see what I was doing wrong - those operations, with appropriately placed dots and the similar function, maintain the unified memory location of the final array.

glwagner May 9, 2024
Maintainer

I think all memory allocation passes through this line in fact:

Oceananigans.jl/src/Grids/zeros_and_ones.jl

Line 7 in fb2c670

zeros(FT, ::GPU, N...) = CUDA.zeros(FT, N...)

which has the architecture. So adding a memory to GPU (for Unified() or Device()) might really be all that's needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilizing Unified Memory for Oceananigans simulations #3592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Utilizing Unified Memory for Oceananigans simulations #3592

BrodiePearson May 8, 2024 Collaborator

Replies: 1 comment · 12 replies

glwagner May 8, 2024 Maintainer

glwagner May 9, 2024 Maintainer

BrodiePearson May 9, 2024 Collaborator Author

glwagner May 9, 2024 Maintainer

BrodiePearson May 9, 2024 Collaborator Author

glwagner May 9, 2024 Maintainer

BrodiePearson
May 8, 2024
Collaborator

Replies: 1 comment 12 replies

glwagner
May 8, 2024
Maintainer

glwagner May 9, 2024
Maintainer

BrodiePearson May 9, 2024
Collaborator Author

glwagner May 9, 2024
Maintainer

BrodiePearson May 9, 2024
Collaborator Author

glwagner May 9, 2024
Maintainer