-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@atomic is slow within AMDGPU.jl #569
Comments
This message you are getting after loading the specific ROCm module combination to have 5.3.3 instead of 5.2 exposed. From LUMI support, there should not be any issue with this config, besides it may not be fully tested by HPE. I don't think this relates anyhow to the slowdown you are experiencing using atomic operations in AMDGPU.jl. |
I will chip in the issue. I have the same problem (tested on Lumi as well) with AMDGPU and atomics. If it helps, this is my MWE using using ParallelStencil, StaticArrays, Atomix, BenchmarkTools
const use_AMD = true
const TA = @static if use_AMD
@init_parallel_stencil(AMDGPU, Float64, 2)
macro myatomic(expr)
return esc(quote
AMDGPU.@atomic $expr
end)
end
ParallelStencil.ROCArray
else
@init_parallel_stencil(Threads, Float64, 2)
macro myatomic(expr)
return esc(quote
Atomix.@atomic $expr
end)
end
Array
end
function init_particles(nxcell, x, y, dx, dy, nx, ny)
ncells = nx * ny
np = nxcell * ncells
pcoords = TA([SA[(rand()*0.9 + 0.05)*dx, (rand()*0.9 + 0.05)*dy] for _ in 1:np])
parent_cell = TA([(0, 0) for _ in 1:np])
nx2, ny2 = length(x)-1, length(y)-1
LinInd = LinearIndices((1:nx2, 1:ny2))
@parallel_indices (i, j) function fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
k = LinInd[i, j]
x0, y0 = x[i], y[j] # lower-left corner of the cell
for l in (1 + nxcell * (k-1)):(nxcell * k)
pcoords[l] = @SVector [x0 + pcoords[l][1], y0 + pcoords[l][2]]
parent_cell[l] = (i, j)
end
return nothing
end
@parallel (1:nx, 1:ny) fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
return pcoords, parent_cell
end
function foo!(F, Fp, buffer, parent_cell)
np = length(Fp)
@parallel (1:np) _foo!(F, Fp, buffer, parent_cell)
return nothing
end
@parallel_indices (ipart) function _foo!(F, Fp, buffer, parent_cell)
inode, jnode = parent_cell[ipart]
# iterate over cells around ipart-th node
for joffset in 0:1
jvertex = joffset + jnode
for ioffset in 0:1
ivertex = ioffset + inode
@myatomic F[ivertex, jvertex] += ω_i = rand() # a weight function is actually called here
@myatomic buffer[ivertex, jvertex] += Fp[ipart] * ω_i
end
end
return nothing
end
let
np = 24
n = 128
nx = ny = n-1
ni = nx , ny
Lx = Ly = 1.0
xvi = xv, yv = range(0, Lx, length=n), range(0, Ly, length=n)
dxi = dx, dy = xv[2] - xv[1], yv[2] - yv[1]
pcoords, parent_cell = init_particles(np, xvi..., dx, dy, nx, ny)
F = @zeros(ni.+1...)
buffer = similar(F)
Fp = @rand(length(pcoords))
@btime foo!($(F, Fp, buffer, parent_cell)...)
# AMDGPU with atomics: 412.973 ms (45 allocations: 2.80 KiB)
# AMDGPU without atomics: 82.009 μs (45 allocations: 2.80 KiB)
end |
I will note that you are using floating point atomics and the OP uses integer atomics. There was an recommendation from LUMI to try out unsafe-atomics. |
How is this option added to the Julia call? |
With @albert-de-montserrat we worked on a MWE comparing AMDGPU, ROCBackend() and HIP C++ perf on the following micro kernel which may be somewhat representative of what happens in @albert-de-montserrat more complete example 👆 function amd_atomic_add!(target1, target2, source, indices)
i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
v = source[i]
AMDGPU.@atomic target1[i1] += v
AMDGPU.@atomic target1[i2] += v
AMDGPU.@atomic target1[i3] += v
AMDGPU.@atomic target1[i4] += v
AMDGPU.@atomic target2[i1] += v
AMDGPU.@atomic target2[i2] += v
AMDGPU.@atomic target2[i3] += v
AMDGPU.@atomic target2[i4] += v
return
end the HIP C++ version would be __global__ void atomic_kernel(DAT *_target1, DAT *_target2, DAT *_source, int *_indices, const int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int i1 = indices(i, 0);
int i2 = indices(i, 1);
int i3 = indices(i, 2);
int i4 = indices(i, 3);
DAT v = source(i);
atomicAdd(&target1(i1), v);
atomicAdd(&target1(i2), v);
atomicAdd(&target1(i3), v);
atomicAdd(&target1(i4), v);
atomicAdd(&target2(i1), v);
atomicAdd(&target2(i2), v);
atomicAdd(&target2(i3), v);
atomicAdd(&target2(i4), v);
} Running these codes on LUMI (MI250x - ROCm 5.3.3) and a Radeon 7800XT (Navi3 - ROCm 5.6.1) gives following timings: Float64Radeon 7800 XT
On LUMI
On LUMI with
Similar behaviour is observed for UInt64Radeon 7800
LUMI
LUMI with
Similar behaviour is observed for Files can be found here PTsolvers/HPCBenchmarks.jl#1 (CUDA version added soon) |
Okay so on LUMI the C++ code is already doing something quite different from Julia even without Can you get the LLVM IR from hipcc on LUMI with something like |
Getting following warning and compilation failure when adding the suggested flags to https://github.com/PTsolvers/HPCBenchmarks.jl/blob/8b8488231a0d8279233d52866f549d2302e81b82/AMDGPU/atomic.jl#L91
|
That should be okay. I think you might also need |
The
The |
Still getting the error
in run(`hipcc -emit-llvm -S --offload-device-only -c -xhip -munsafe-fp-atomics -O3 -o $libname --shared -fPIC atomic.cu`)
Libdl.dlopen("./$libname") do lib |
You don't want |
So we are just emitting So the issue might be that we emit |
If I remove it and change the
The "main" function of the external cpp code being |
But now you are missing |
Ok, got it now. Was confused. The generated out.ll is out.ll.zip |
Can you also send me the one withtout
|
Here without unsafe atomics out_nounsafe.ll.zip |
Oh wow that is even more interesting.
That is easier. That means AMDGPU.jl is emiting atomics with a stronger ordering than needed. Try:
The only thing we don't handle there is |
This is consistent with
AMDGPU uses Atomix.jl under the hood. The |
Thanks - will check the benchmarks and report |
Adding the On LUMI (Float64)
No effect/change on the Radeon 7800XT |
Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one? |
I was thinking about this, but not really. The default is correct and needs to be consistent across architectures. Fundamentally the problem here is that we get lowered to a cmpxchg loop and so we might need to look at syncscope as well. |
Thanks @vchuravy ! This also speeds up my MWE by 400 times :) |
Thanks for the effort. My code speeds up, but is yet 3 times slower than on an A100. |
Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint): "julia" => Trial(17.575 μs)
"reference" => Trial(9.815 μs)
"julia-ka" => Trial(25.806 μs) |
What are the timings without the flag? |
Those are without the unsafe flag, I don't know what is the equivalent flag for CUDA. I get the same timings with and without |
Sorry could we keep the conversation focused here? I assume that:
Means there is a performance between A100 and MI250X? |
I think with #604 we can close this. |
I have noticed that with KernelAbstractions the use of @atomic is very slow on AMDGPU compared to CUDA.
I have a test example at https://github.com/CliMA/CGDycore.jl/tree/main/TestAMD with results for a kernel with @atomic and the same kernel with removed @atomic macros. You will find timing results in InfoCUDAAMD. There is an additional data file with the global indices for the atomic gather operation which is input for the testExample. Before running the code you have to comment out the right backend.
All test are done on the LUMI supercomputer (https://lumi-supercomputer.eu/)
After login you get the following message:
rocm/5.3.3:
This module is not installed the way it is intended by AMD. Hence some parts of ROCm may
be broken or offer slower than expected performance and there is nothing the LUMI
User Support Team can do about that. As the module is also not an official HPE Cray
module there is no guarantee that it will integrate properly with HPE Cray PE modules
or work correctly in that environment. Note that HPE Cray PE programming environments
up to at least 23.02 have only been tested with ROCm version prior to 5.3 by HPE.
The text was updated successfully, but these errors were encountered: