Implement ShflUp, ShflDown and ShflXor #1924

AuroraPerego · 2023-02-24T15:37:54Z

Following the logic of alpaka::shfl, I've implemented also the other three methods: alpaka::shfl_up, alpaka::shfl_down, alpaka::shfl_xor.
I've also added the tests for those methods.

I have two comments on that:

is there a reason for not having templated the alpaka::shfl method and having defined two different methods, one for std::int_32t and the other one for float? (btw I think that also unsigned, long and double should be added if we don't want to template the methods)
SYCL doesn't support a sub group (aka warp) size different from the width in those methods. For the moment I added an assert to check that, but maybe we should think of a way to determine the scrLane based on the parameters given in input to alpaka::shfl_up, alpaka::shfl_down, alpaka::shfl_xor and then use alpaka::shfl when width != sub_group_size.

Last thing, when running the tests I noticed that, when selecting the CUDA backend, sometimes the kernel is not executed (I put an assert(false) in a kernel and all the tests passed without crashing). I don't know why, maybe that could be discussed in a separated issue.

psychocoderHPC · 2023-03-03T16:10:24Z

is there a reason for not having templated the alpaka::shfl method and having defined two different methods,

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

psychocoderHPC · 2023-03-03T16:17:43Z

SYCL doesn't support a sub group (aka warp) size different from the width in those methods. For the moment I added an assert to check that, but maybe we should think of a way to determine the scrLane based on the parameters given in input ...

IMO As you suggested emulating the shufl_down, ... for all sub groups different than the warp size based in the lane is the best way.

AuroraPerego · 2023-03-04T14:55:02Z

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

Ok, is it ok if I replace the two existing methods with the template?

AuroraPerego · 2023-03-04T14:56:42Z

IMO As you suggested emulating the shufl_down, ... for all sub groups different than the warp size based in the lane is the best way.

I'll try to implement it then

psychocoderHPC · 2023-03-06T09:18:56Z

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

Ok, is it ok if I replace the two existing methods with the template?

I would keep it like it is, it looks like AMD is still only shipping float and int signatures for this function.
ROCm/HIP@04f3e3e
If AMD is providing a generic function with a template feel free to change it.

fwyzard · 2023-03-06T10:27:39Z

Actually, I see the other types on a local ROCm installation:

$ grep -E '(\w+ +)+__shfl\>' -r /opt/rocm/include/
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:int __shfl(int var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned int __shfl(unsigned int var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:float __shfl(float var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:double __shfl(double var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:long __shfl(long var, int src_lane, int width = warpSize)
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned long __shfl(unsigned long var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:long long __shfl(long long var, int src_lane, int width = warpSize)
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned long long __shfl(unsigned long long var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/nvidia_detail/nvidia_hip_runtime_api.h:#define __shfl(...)      __shfl_sync(0xffffffff, __VA_ARGS__)
/opt/rocm/include/rocprim/intrinsics/warp_shuffle.hpp:            return __shfl(v, src_lane, width);

See also https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/include/hip/amd_detail/amd_warp_functions.h .

AuroraPerego · 2023-11-03T16:55:14Z

I've rebased it and I think it is ready for review

fwyzard · 2023-11-03T23:48:11Z

include/alpaka/warp/Traits.hpp

    {
        using ImplementationBase = concepts::ImplementationBase<ConceptWarp, TWarp>;
        return trait::Shfl<ImplementationBase>::shfl(warp, value, srcLane, width ? width : getSize(warp));
    }

-    //! shfl for float vals
+    //! Exchange data between threads within a warp.
+    //! It copy from a lane with lower ID relative to caller.


either

Suggested change

//! It copy from a lane with lower ID relative to caller.

//! It copies from a lane with lower ID relative to caller.

or

Suggested change

//! It copy from a lane with lower ID relative to caller.

//! Copy from a lane with lower ID relative to caller.

fwyzard · 2023-11-04T00:01:16Z

include/alpaka/warp/Traits.hpp

+    //!     __shared__ int32_t values[warpsize];
+    //!     values[threadIdx.x] = value;
+    //!     __syncthreads();
+    //!     return values[(-delta + width*floor(threadIdx.x/width))%width];


This formula seems wrong:

threadIdx.x/width will be an integer between 0 and blockDim.x / width

passing it through floor(...) is unnecessary, and will be the same value

width * floor(threadIdx.x/width) will be a multiple of width

(-delta + width*floor(threadIdx.x/width)) or (width*floor(threadIdx.x/width) - delta) will be a multiple of width, minus delta

then, (...) % width is equivalent to (width - delta) % width

which is the same value independently of threadIdx.x.

fwyzard · 2023-12-07T15:44:28Z

include/alpaka/warp/Traits.hpp

+    //!     __shared__ int32_t values[warpsize];
+    //!     values[threadIdx.x] = value;
+    //!     __syncthreads();
+    //!     return values[width*(threadIdx.x/width) + threadIdx.x%width - delta];


width*(threadIdx.x/width) + threadIdx.x%width - delta is equal to threadIdx.x - delta, so this is simply

Suggested change

//! return values[width*(threadIdx.x/width) + threadIdx.x%width - delta];

//! return values[threadIdx.x - delta];

What was it supposed to be ?

I think it should be something like

//! return (threadIdx.x % width >= delta) ? values[threadIdx.x - delta] : values[threadIdx.x];

fwyzard · 2023-12-07T17:22:38Z

include/alpaka/warp/WarpUniformCudaHipBuiltIn.hpp

                int srcLane,
-                std::int32_t width) -> float
+                std::int32_t width) -> T
            {
 #        if defined(ALPAKA_ACC_GPU_CUDA_ENABLED)
                return __shfl_sync(activemask(warp), val, srcLane, width);


I know that this was already there, but shouldn't this be

return __shfl_sync(0xffffffff, val, srcLane, width);

instead ?

Possible preceded by a __syncwarp() ?

This comes from #1273.
@psychocoderHPC what do you think ?

For example, HIP does

#if CUDA_VERSION >= CUDA_9000 #define __shfl(...) __shfl_sync(0xffffffff, __VA_ARGS__) #define __shfl_up(...) __shfl_up_sync(0xffffffff, __VA_ARGS__) #define __shfl_down(...) __shfl_down_sync(0xffffffff, __VA_ARGS__) #define __shfl_xor(...) __shfl_xor_sync(0xffffffff, __VA_ARGS__) #endif // CUDA_VERSION >= CUDA_9000

To be consistent with other backends may be return __shfl_sync(0xffffffff, val, srcLane, width); but I am not fully sure.

I'll make a PR for this.

fwyzard · 2023-12-09T17:21:57Z

@SimeonEhrig the OSX debug builds are failing with

Error: You are using macOS 11.
We (and Apple) do not provide support for this old version.

Is it expected ?

fwyzard · 2023-12-09T17:56:25Z