Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ShflUp, ShflDown and ShflXor #1924

Merged
merged 3 commits into from
Dec 12, 2023

Conversation

AuroraPerego
Copy link
Contributor

Following the logic of alpaka::shfl, I've implemented also the other three methods: alpaka::shfl_up, alpaka::shfl_down, alpaka::shfl_xor.
I've also added the tests for those methods.

I have two comments on that:

  • is there a reason for not having templated the alpaka::shfl method and having defined two different methods, one for std::int_32t and the other one for float? (btw I think that also unsigned, long and double should be added if we don't want to template the methods)
  • SYCL doesn't support a sub group (aka warp) size different from the width in those methods. For the moment I added an assert to check that, but maybe we should think of a way to determine the scrLane based on the parameters given in input to alpaka::shfl_up, alpaka::shfl_down, alpaka::shfl_xor and then use alpaka::shfl when width != sub_group_size.

Last thing, when running the tests I noticed that, when selecting the CUDA backend, sometimes the kernel is not executed (I put an assert(false) in a kernel and all the tests passed without crashing). I don't know why, maybe that could be discussed in a separated issue.

@psychocoderHPC
Copy link
Member

is there a reason for not having templated the alpaka::shfl method and having defined two different methods,

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

@psychocoderHPC
Copy link
Member

SYCL doesn't support a sub group (aka warp) size different from the width in those methods. For the moment I added an assert to check that, but maybe we should think of a way to determine the scrLane based on the parameters given in input ...

IMO As you suggested emulating the shufl_down, ... for all sub groups different than the warp size based in the lane is the best way.

@AuroraPerego
Copy link
Contributor Author

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

Ok, is it ok if I replace the two existing methods with the template?

@AuroraPerego
Copy link
Contributor Author

IMO As you suggested emulating the shufl_down, ... for all sub groups different than the warp size based in the lane is the best way.

I'll try to implement it then

@psychocoderHPC
Copy link
Member

I assume older CUDA supported shuffle only for these types and the current version is supporting any type

Ok, is it ok if I replace the two existing methods with the template?

I would keep it like it is, it looks like AMD is still only shipping float and int signatures for this function.
ROCm/HIP@04f3e3e
If AMD is providing a generic function with a template feel free to change it.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 6, 2023

Actually, I see the other types on a local ROCm installation:

$ grep -E '(\w+ +)+__shfl\>' -r /opt/rocm/include/
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_hip_cooperative_groups.h:    return __shfl(var, lane, WAVEFRONT_SIZE);
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:int __shfl(int var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned int __shfl(unsigned int var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:float __shfl(float var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:double __shfl(double var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:long __shfl(long var, int src_lane, int width = warpSize)
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned long __shfl(unsigned long var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:long long __shfl(long long var, int src_lane, int width = warpSize)
/opt/rocm/include/hip/amd_detail/amd_warp_functions.h:unsigned long long __shfl(unsigned long long var, int src_lane, int width = warpSize) {
/opt/rocm/include/hip/nvidia_detail/nvidia_hip_runtime_api.h:#define __shfl(...)      __shfl_sync(0xffffffff, __VA_ARGS__)
/opt/rocm/include/rocprim/intrinsics/warp_shuffle.hpp:            return __shfl(v, src_lane, width);

See also https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/include/hip/amd_detail/amd_warp_functions.h .

@AuroraPerego
Copy link
Contributor Author

I've rebased it and I think it is ready for review

{
using ImplementationBase = concepts::ImplementationBase<ConceptWarp, TWarp>;
return trait::Shfl<ImplementationBase>::shfl(warp, value, srcLane, width ? width : getSize(warp));
}

//! shfl for float vals
//! Exchange data between threads within a warp.
//! It copy from a lane with lower ID relative to caller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either

Suggested change
//! It copy from a lane with lower ID relative to caller.
//! It copies from a lane with lower ID relative to caller.

or

Suggested change
//! It copy from a lane with lower ID relative to caller.
//! Copy from a lane with lower ID relative to caller.

//! __shared__ int32_t values[warpsize];
//! values[threadIdx.x] = value;
//! __syncthreads();
//! return values[(-delta + width*floor(threadIdx.x/width))%width];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula seems wrong:

  • threadIdx.x/width will be an integer between 0 and blockDim.x / width
  • passing it through floor(...) is unnecessary, and will be the same value
  • width * floor(threadIdx.x/width) will be a multiple of width
  • (-delta + width*floor(threadIdx.x/width)) or (width*floor(threadIdx.x/width) - delta) will be a multiple of width, minus delta
  • then, (...) % width is equivalent to (width - delta) % width

which is the same value independently of threadIdx.x.

@AuroraPerego AuroraPerego force-pushed the shfl branch 2 times, most recently from 10feef9 to 3d2370a Compare November 6, 2023 14:08
//! __shared__ int32_t values[warpsize];
//! values[threadIdx.x] = value;
//! __syncthreads();
//! return values[width*(threadIdx.x/width) + threadIdx.x%width - delta];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

width*(threadIdx.x/width) + threadIdx.x%width - delta is equal to threadIdx.x - delta, so this is simply

Suggested change
//! return values[width*(threadIdx.x/width) + threadIdx.x%width - delta];
//! return values[threadIdx.x - delta];

What was it supposed to be ?

Copy link
Contributor

@fwyzard fwyzard Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like

    //!     return (threadIdx.x % width >= delta) ? values[threadIdx.x - delta] : values[threadIdx.x];

int srcLane,
std::int32_t width) -> float
std::int32_t width) -> T
{
# if defined(ALPAKA_ACC_GPU_CUDA_ENABLED)
return __shfl_sync(activemask(warp), val, srcLane, width);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this was already there, but shouldn't this be

return __shfl_sync(0xffffffff, val, srcLane, width);

instead ?

Possible preceded by a __syncwarp() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from #1273.
@psychocoderHPC what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, HIP does

#if CUDA_VERSION >= CUDA_9000
#define __shfl(...)      __shfl_sync(0xffffffff, __VA_ARGS__)
#define __shfl_up(...)   __shfl_up_sync(0xffffffff, __VA_ARGS__)
#define __shfl_down(...) __shfl_down_sync(0xffffffff, __VA_ARGS__)
#define __shfl_xor(...)  __shfl_xor_sync(0xffffffff, __VA_ARGS__)
#endif // CUDA_VERSION >= CUDA_9000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other backends may be return __shfl_sync(0xffffffff, val, srcLane, width); but I am not fully sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 9, 2023

@SimeonEhrig the OSX debug builds are failing with

Error: You are using macOS 11.
We (and Apple) do not provide support for this old version.

Is it expected ?

Comment on lines 145 to 153
std::int32_t offset_int = static_cast<std::int32_t>(offset);
auto const actual_group = warp.m_item_warp.get_sub_group();
auto actual_item_id = static_cast<std::int32_t>(actual_group.get_local_linear_id());
auto const actual_group_id = actual_item_id / width;
auto const actual_src_id = actual_item_id - offset_int;
auto const src = actual_src_id >= actual_group_id * width
? sycl::id<1>{static_cast<std::size_t>(actual_src_id)}
: sycl::id<1>{static_cast<std::size_t>(actual_item_id)};
return sycl::select_from_group(actual_group, value, src);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AuroraPerego do you think it would be worth checking if width is the same as get_sub_group().get_max_local_range(), and in that case call shift_group_right(actual_group, value, offset) directly, instead of select_from_group ?

{
std::int32_t offset_int = static_cast<std::int32_t>(offset);
auto const actual_group = warp.m_item_warp.get_sub_group();
auto actual_item_id = static_cast<std::int32_t>(actual_group.get_local_linear_id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can also be const, since all the other ones are.

warp::WarpGenericSycl<TDim> const& warp,
T value,
std::int32_t mask,
std::int32_t /*width*/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA version does make use of width:

__shfl_xor_sync() calculates a source line ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.

Can you implement the same behaviour ?

Comment on lines 119 to 126
// /* If width < srcLane the sub-group needs to be split into assumed subdivisions. The first item of
// each
// subdivision has the assumed index 0. The srcLane index is relative to the subdivisions.

Example: If we assume a sub-group size of 32 and a width of 16 we will receive two subdivisions:
The first starts at sub-group index 0 and the second at sub-group index 16. For srcLane = 4 the
first subdivision will access the value at sub-group index 4 and the second at sub-group index 20. */
// Example: If we assume a sub-group size of 32 and a width of 16 we will receive two subdivisions:
// The first starts at sub-group index 0 and the second at sub-group index 16. For srcLane = 4 the
// first subdivision will access the value at sub-group index 4 and the second at sub-group
// index 20. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the double comment ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

auto const actual_src_id = static_cast<std::size_t>(srcLane + actual_group_id * width);
auto const src = sycl::id<1>{actual_src_id};
std::uint32_t const w = static_cast<std::uint32_t>(width);
unsigned int const start_index = actual_group.get_local_linear_id() / w * w;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned int const start_index = actual_group.get_local_linear_id() / w * w;
std::uint32_t const start_index = actual_group.get_local_linear_id() / w * w;

for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed also the others

@AuroraPerego AuroraPerego force-pushed the shfl branch 2 times, most recently from 2babf41 to 7b92bcd Compare December 12, 2023 09:18
std::int32_t width)
{
auto const actual_group = warp.m_item_warp.get_sub_group();
std::uint32_t const w = static_cast<std::uint32_t>(width);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you simply use width directly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gives a warning when doing operations with the thread id that is unsigned:
conversion to 'uint32_t' {aka 'unsigned int'} from 'int32_t' {aka 'int'} may change the sign of the result [-Wsign-conversion]

@fwyzard
Copy link
Contributor

fwyzard commented Dec 12, 2023

I think you should update the license at the top of include/alpaka/warp/WarpGenericSycl.hpp:

 /* Copyright 2023 Jan Stephan, Luca Ferragina, Andrea Bocci, Aurora Perego
  * SPDX-License-Identifier: MPL-2.0
+ *
+ * The implementations of Shfl::shfl(), ShflUp::shfl_up(), ShflDown::shfl_down() and ShflXor::shfl_xor() are derived from Intel DPCT.
+ * Copyright (C) Intel Corporation
+ * SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+ * See https://llvm.org/LICENSE.txt for license information.
  */

@AuroraPerego AuroraPerego force-pushed the shfl branch 4 times, most recently from 63be4dd to 063dfe1 Compare December 12, 2023 10:24
fwyzard
fwyzard previously approved these changes Dec 12, 2023
@fwyzard
Copy link
Contributor

fwyzard commented Dec 12, 2023

@psychocoderHPC can you merge this PR ?
I assume the Xcode failures on OSX are unrelated.

@psychocoderHPC
Copy link
Member

I removed the XCode test from the require test list and will have a short look to this PR.

@psychocoderHPC psychocoderHPC merged commit 6312741 into alpaka-group:develop Dec 12, 2023
21 of 23 checks passed
@AuroraPerego AuroraPerego deleted the shfl branch April 23, 2024 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants