Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@atomic is slow within AMDGPU.jl #569

Closed
OsKnoth opened this issue Dec 12, 2023 · 30 comments
Closed

@atomic is slow within AMDGPU.jl #569

OsKnoth opened this issue Dec 12, 2023 · 30 comments

Comments

@OsKnoth
Copy link

OsKnoth commented Dec 12, 2023

I have noticed that with KernelAbstractions the use of @atomic is very slow on AMDGPU compared to CUDA.

I have a test example at https://github.com/CliMA/CGDycore.jl/tree/main/TestAMD with results for a kernel with @atomic and the same kernel with removed @atomic macros. You will find timing results in InfoCUDAAMD. There is an additional data file with the global indices for the atomic gather operation which is input for the testExample. Before running the code you have to comment out the right backend.

All test are done on the LUMI supercomputer (https://lumi-supercomputer.eu/)

After login you get the following message:
rocm/5.3.3:
This module is not installed the way it is intended by AMD. Hence some parts of ROCm may
be broken or offer slower than expected performance and there is nothing the LUMI
User Support Team can do about that. As the module is also not an official HPE Cray
module there is no guarantee that it will integrate properly with HPE Cray PE modules
or work correctly in that environment. Note that HPE Cray PE programming environments
up to at least 23.02 have only been tested with ROCm version prior to 5.3 by HPE.

@luraess
Copy link
Contributor

luraess commented Dec 12, 2023

After login you get the following message:

This message you are getting after loading the specific ROCm module combination to have 5.3.3 instead of 5.2 exposed. From LUMI support, there should not be any issue with this config, besides it may not be fully tested by HPE. I don't think this relates anyhow to the slowdown you are experiencing using atomic operations in AMDGPU.jl.

@albert-de-montserrat
Copy link

I will chip in the issue. I have the same problem (tested on Lumi as well) with AMDGPU and atomics. If it helps, this is my MWE using ParallelStencil.jl instead of KernelAbstractions.jl

using ParallelStencil, StaticArrays, Atomix, BenchmarkTools
const use_AMD = true

const TA = @static if use_AMD 
    @init_parallel_stencil(AMDGPU, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            AMDGPU.@atomic $expr
        end)
    end

    ParallelStencil.ROCArray
else
    @init_parallel_stencil(Threads, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            Atomix.@atomic $expr
        end)
    end

    Array
end

function init_particles(nxcell, x, y, dx, dy, nx, ny)
    ncells = nx * ny
    np = nxcell * ncells
    pcoords = TA([SA[(rand()*0.9 + 0.05)*dx, (rand()*0.9 + 0.05)*dy] for _ in 1:np])
    parent_cell = TA([(0, 0) for _ in 1:np])
    
    nx2, ny2 = length(x)-1, length(y)-1
    LinInd = LinearIndices((1:nx2, 1:ny2))

    @parallel_indices (i, j) function fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
        k = LinInd[i, j] 
        x0, y0 = x[i], y[j] # lower-left corner of the cell
        for l in (1 + nxcell * (k-1)):(nxcell * k)
            pcoords[l] = @SVector [x0 + pcoords[l][1], y0 + pcoords[l][2]]
            parent_cell[l] = (i, j)
        end
        return nothing
    end

    @parallel (1:nx, 1:ny) fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)

    return pcoords, parent_cell
end

function foo!(F, Fp, buffer, parent_cell)
    np = length(Fp)
    @parallel (1:np) _foo!(F, Fp, buffer, parent_cell)
    return nothing
end

@parallel_indices (ipart) function _foo!(F, Fp, buffer, parent_cell)
    inode, jnode = parent_cell[ipart]
    # iterate over cells around ipart-th node
    for joffset in 0:1
        jvertex = joffset + jnode
        for ioffset in 0:1
            ivertex = ioffset + inode
            @myatomic F[ivertex, jvertex] += ω_i = rand() # a weight function is actually called here
            @myatomic buffer[ivertex, jvertex] += Fp[ipart] * ω_i 
        end
    end

    return nothing
end

let 
    np = 24
    n = 128
    nx = ny = n-1
    ni = nx , ny
    Lx = Ly = 1.0
    xvi = xv, yv = range(0, Lx, length=n), range(0, Ly, length=n)
    dxi = dx, dy = xv[2] - xv[1], yv[2] - yv[1]
    
    pcoords, parent_cell = init_particles(np, xvi..., dx, dy, nx, ny)

    F = @zeros(ni.+1...)
    buffer = similar(F)
    Fp = @rand(length(pcoords))

    @btime foo!($(F, Fp, buffer, parent_cell)...)
    # AMDGPU with atomics:    412.973 ms (45 allocations: 2.80 KiB)
    # AMDGPU without atomics:  82.009 μs (45 allocations: 2.80 KiB)
end

@vchuravy
Copy link
Member

I will note that you are using floating point atomics and the OP uses integer atomics. There was an recommendation from LUMI to try out unsafe-atomics.

E.g. https://reviews.llvm.org/D91546?id=305522

@OsKnoth
Copy link
Author

OsKnoth commented Dec 14, 2023

How is this option added to the Julia call?

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

With @albert-de-montserrat we worked on a MWE comparing AMDGPU, ROCBackend() and HIP C++ perf on the following micro kernel which may be somewhat representative of what happens in @albert-de-montserrat more complete example 👆

function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    AMDGPU.@atomic target1[i1] += v
    AMDGPU.@atomic target1[i2] += v
    AMDGPU.@atomic target1[i3] += v
    AMDGPU.@atomic target1[i4] += v
    AMDGPU.@atomic target2[i1] += v
    AMDGPU.@atomic target2[i2] += v
    AMDGPU.@atomic target2[i3] += v
    AMDGPU.@atomic target2[i4] += v
    return
end

the HIP C++ version would be

__global__ void atomic_kernel(DAT *_target1, DAT *_target2, DAT *_source, int *_indices, const int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int i1 = indices(i, 0);
  int i2 = indices(i, 1);
  int i3 = indices(i, 2);
  int i4 = indices(i, 3);
  DAT v = source(i);
  atomicAdd(&target1(i1), v);
  atomicAdd(&target1(i2), v);
  atomicAdd(&target1(i3), v);
  atomicAdd(&target1(i4), v);
  atomicAdd(&target2(i1), v);
  atomicAdd(&target2(i2), v);
  atomicAdd(&target2(i3), v);
  atomicAdd(&target2(i4), v);
}

Running these codes on LUMI (MI250x - ROCm 5.3.3) and a Radeon 7800XT (Navi3 - ROCm 5.6.1) gives following timings:

Float64

Radeon 7800 XT

  "julia" => Trial(43.370 μs)
  "reference" => Trial(38.450 μs)
  "julia-ka" => Trial(45.190 μs)

On LUMI

  "julia" => Trial(345.640 μs)
  "reference" => Trial(49.636 μs)
  "julia-ka" => Trial(350.620 μs)

On LUMI with -xhip -munsafe-fp-atomics (for C++ code only)

  "julia" => Trial(351.731 μs)
  "reference" => Trial(12.635 μs)
  "julia-ka" => Trial(357.743 μs)

Similar behaviour is observed for Float32

UInt64

Radeon 7800

  "julia" => Trial(23.400 μs)
  "reference" => Trial(18.530 μs)
  "julia-ka" => Trial(25.320 μs)

LUMI

  "julia" => Trial(39.395 μs)
  "reference" => Trial(12.714 μs)
  "julia-ka" => Trial(42.992 μs)

LUMI with -xhip -munsafe-fp-atomics

  "julia" => Trial(39.085 μs)
  "reference" => Trial(12.624 μs)
  "julia-ka" => Trial(42.531 μs)

Similar behaviour is observed for UInt32

Files can be found here PTsolvers/HPCBenchmarks.jl#1 (CUDA version added soon)

@vchuravy
Copy link
Member

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only ?
Also @device_code dir="ir" for Julia on LUMI would be helpful.

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only ? Also @device_code dir="ir" for Julia on LUMI would be helpful.

Getting following warning and compilation failure when adding the suggested flags to https://github.com/PTsolvers/HPCBenchmarks.jl/blob/8b8488231a0d8279233d52866f549d2302e81b82/AMDGPU/atomic.jl#L91

clang-15: warning: -lgcc_s: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lgcc: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lpthread: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lm: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lrt: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,--enable-new-dtags: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,-rpath=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/hip/lib:/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lamdhip64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lclang_rt.builtins-x86_64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: argument unused during compilation: '--shared' [-Wunused-command-line-argument]

@vchuravy
Copy link
Member

That should be okay. I think you might also need -c since you don't want to link.

@albert-de-montserrat
Copy link

The --offload-device-only flag still doesn't work for me

clang-14: error: unsupported option '--offload-device-only'

The @device_code for Julia is here https://gist.github.com/albert-de-montserrat/ab4a66b1e5a1b547673041e4c5f190e2

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

Still getting the error

ERROR: LoadError: could not load library "./atomic.so"
./atomic.so: invalid ELF header

in

run(`hipcc -emit-llvm -S --offload-device-only -c  -xhip -munsafe-fp-atomics -O3 -o $libname --shared -fPIC atomic.cu`)

Libdl.dlopen("./$libname") do lib

@vchuravy
Copy link
Member

You don't want --shared -fIC since your -o should be a .ll

@vchuravy
Copy link
Member

So we are just emitting atomicrmw fadd double addrspace(1)* %73, double %51 seq_cst, align 8, !dbg !374

So the issue might be that we emit seq_cst instead of acq_rel

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

You don't want --shared -fIC since your -o should be a .ll, hipcc -xhip -munsafe-fp-atomics -O3 -o out.ll atomic.cu

If I remove it and change the -o from .so to .ll I am then getting

ld.lld: error: undefined symbol: main
>>> referenced by start.S:104 (../sysdeps/x86_64/start.S:104)
>>>               /usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/crt1.o:(_start)
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
ERROR: LoadError: failed process: Process(`hipcc -xhip -munsafe-fp-atomics -O3 -o atomic.ll atomic.cu`, ProcessExited(1)) [1]

The "main" function of the external cpp code being extern "C" EXPORT_API void run_benchmark(){...}

@vchuravy
Copy link
Member

But now you are missing -emit-llvm -S --offload-device-only -c?

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

But now you are missing -emit-llvm -S --offload-device-only -c?

Ok, got it now. Was confused. The generated out.ll is out.ll.zip

@vchuravy
Copy link
Member

Can you also send me the one withtout unsafe-fp-atomics, but this is already promising:

  %41 = load i32, i32 addrspace(1)* %23, align 4, !tbaa !11, !amdgpu.noclobber !5
  %42 = load i32, i32 addrspace(1)* %27, align 4, !tbaa !11, !amdgpu.noclobber !5
  %43 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %37, double %29) #4
  %44 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %40, double %29) #4

@luraess
Copy link
Contributor

luraess commented Dec 15, 2023

Here without unsafe atomics out_nounsafe.ll.zip

@vchuravy
Copy link
Member

Oh wow that is even more interesting.

  %41 = atomicrmw fadd double addrspace(1)* %36, double %29 syncscope("agent-one-as") monotonic, align 8
  %42 = atomicrmw fadd double addrspace(1)* %38, double %29 syncscope("agent-one-as") monotonic, align 8

That is easier. That means AMDGPU.jl is emiting atomics with a stronger ordering than needed.

Try:

using Atomix: @atomic, @atomicswap, @atomicreplace


function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    @atomic :montonic target1[i1] += v
    @atomic :montonic target1[i2] += v
    @atomic :montonic target1[i3] += v
    @atomic :montonic target1[i4] += v
    @atomic :montonic target2[i1] += v
    @atomic :montonic target2[i2] += v
    @atomic :montonic target2[i3] += v
    @atomic :montonic target2[i4] += v
    return
end

The only thing we don't handle there is syncscope so we might need to add that.
(Atomix is the device portable interface for atomic arrays accesses, which is why I recommend it for KA)

@vchuravy
Copy link
Member

This is consistent with Base.@atomic:

help?> @atomic
  @atomic var
  @atomic order ex

  Mark var or ex as being performed atomically, if ex is a supported expression. If no
  order is specified it defaults to :sequentially_consistent.

AMDGPU uses Atomix.jl under the hood.

The unsafe-fp-atomics may be something to explore, but I would bet that biggest issue is the different default choice for atomic order.

@luraess
Copy link
Contributor

luraess commented Dec 18, 2023

Thanks - will check the benchmarks and report

@luraess
Copy link
Contributor

luraess commented Dec 18, 2023

Adding the :monotonic keyword to the @atomic macro in Atomix (used in KA lernel) and AMDGPU, results in following results (NOT using unsafe-fp-atomics in the C++ reference):

On LUMI (Float64)

  "julia" => Trial(53.444 μs)
  "reference" => Trial(49.606 μs)
  "julia-ka" => Trial(58.644 μs)

No effect/change on the Radeon 7800XT

@luraess
Copy link
Contributor

luraess commented Dec 18, 2023

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

@vchuravy
Copy link
Member

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

I was thinking about this, but not really. The default is correct and needs to be consistent across architectures.
The docs could probably be improved to call this out.

Fundamentally the problem here is that we get lowered to a cmpxchg loop and so we might need to look at syncscope as well.

@albert-de-montserrat
Copy link

Thanks @vchuravy ! This also speeds up my MWE by 400 times :)

@OsKnoth
Copy link
Author

OsKnoth commented Dec 18, 2023

Thanks for the effort. My code speeds up, but is yet 3 times slower than on an A100.

@albert-de-montserrat
Copy link

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):

  "julia" => Trial(17.575 μs)
  "reference" => Trial(9.815 μs)
  "julia-ka" => Trial(25.806 μs)

@luraess
Copy link
Contributor

luraess commented Dec 19, 2023

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):

  "julia" => Trial(17.575 μs)

  "reference" => Trial(9.815 μs)

  "julia-ka" => Trial(25.806 μs)

What are the timings without the flag?

@albert-de-montserrat
Copy link

albert-de-montserrat commented Dec 19, 2023

Those are without the unsafe flag, I don't know what is the equivalent flag for CUDA. I get the same timings with and without :monotonic here.

@vchuravy
Copy link
Member

Sorry could we keep the conversation focused here?
If there is an issue with atomic operations in CUDA.jl please open an issue there.

I assume that:

My code speeds up, but is yet 3 times slower than on an A100.

Means there is a performance between A100 and MI250X?
It is hard to tell what the cause for this is from afar. The first thing I would try is to write the code in HIP and make sure that the performance different is apparent there also, if it isn't then we can investigate further.

@pxl-th
Copy link
Member

pxl-th commented Feb 29, 2024

I think with #604 we can close this.
Feel free to re-open though, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants