SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities #635

guangyey · 2024-07-23T03:41:36Z

Change 1:
SYCL is shifting to SYCL group algorithm API to unify sub-group, work-group APIs, e.g. reduce_over_group, group_barrier, and shift_group here. And old style (use separate member function of each structure) is being deprecated.

Old	SYCL-2020
sg.shuffle_down(x, 1)	sycl::shift_group_left(sg, x, 1)
sg.shuffle_up(x, 1)	sycl::shift_group_right(sg, x, 1)
sg.shuffle(x, id)	sycl::select_from_group(sg, x, id)
sg.shuffle_xor(x, mask)	sycl::permute_group_by_xor(sg, x, mask)

FYI: Please don't use sg.shuffle members anymore since they are deprecated.

Change 2:
sycl::shfit_group_xxx is more restricted than sg.shuffle, requiring the object shifted is C++ trivially copyable object. We implemented private pair instead of std::pair in the commit.
FYI: CUDA is using thrust::pair in kernel.

fengyuan14 · 2024-07-24T00:46:45Z

What's the bundle required?

guangyey · 2024-07-24T01:40:02Z

What's the bundle required?

THe reason is that std::pair is not supported in sycl::shift_group_left. @xytintel will be going to implement a customed pair structure, a countpart thrust::pair, supported in sycl::shift_group_left and use the customed pair in our SYCL kernel code.

fengyuan14 · 2024-07-24T06:55:38Z

/opt/intel/oneapi/pytorch-gpu-dev-0.5/include/sycl/group_algorithm.hpp:541:1: note: candidate template ignored: requirement 'std::is_trivially_copyable_v<std::pair<unsigned char, unsigned char>> || detail::is_vec<std::pair<unsigned char, unsigned char>>::value' was not satisfied [with Group = sub_group, T = std::pair<unsigned char, unsigned char>]

Cannot totally understand the failure. As SYCL spec, std::pair is supposed to C++ trivially copyable.

fengyuan14 · 2024-07-24T07:03:11Z

Verify on c++17, std::pair is not a C++ trivially copyable structure.

(dev) fengyuan@fy-9900:~/workspace/test$ cat test.cpp
#include <iostream>
#include <utility>
#include <type_traits>

int main() {
  std::cout << std::is_trivially_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  return 0;
}
(dev) fengyuan@fy-9900:~/workspace/test$ g++ -std=c++17 test.cpp  -o test
(dev) fengyuan@fy-9900:~/workspace/test$ ./test
0

fengyuan14 · 2024-07-24T07:06:42Z

/opt/intel/oneapi/pytorch-gpu-dev-0.5/include/sycl/group_algorithm.hpp:541:1: note: candidate template ignored: requirement 'std::is_trivially_copyable_v<std::pair<unsigned char, unsigned char>> || detail::is_vec<std::pair<unsigned char, unsigned char>>::value' was not satisfied [with Group = sub_group, T = std::pair<unsigned char, unsigned char>]

Cannot totally understand the failure. As SYCL spec, std::pair is supposed to C++ trivially copyable.

Spec is saying the std structures listed are a plus list to support device copyable in SYCL.

fengyuan14 · 2024-07-24T07:13:03Z

Verified,

(dev) fengyuan@fy-9900:~/workspace/test$ cat test.cpp
#include <iostream>
#include <utility>
#include <type_traits>
#include <tuple>
#include <sycl/sycl.hpp>

int main() {
  std::cout << std::is_trivially_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  std::cout << sycl::is_device_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  return 0;
}
(dev) fengyuan@fy-9900:~/workspace/test$ icpx -fsycl -std=c++17 test.cpp  -o test
(dev) fengyuan@fy-9900:~/workspace/test$ ./test
0
1

jbrodman · 2024-07-26T18:10:28Z

Hi - I think several cases where you're migrating to shifts or permutes could also be greatly simplified by using SYCL 2020 reduce_over_group at either sub-group or work-group scope.

jbrodman · 2024-07-26T18:11:24Z

Additionally, any calls to nd_item::barrier should move to sycl::group_barrier.

fengyuan14 · 2024-07-27T08:41:45Z

Additionally, any calls to nd_item::barrier should move to sycl::group_barrier.

@jbrodman Thanks for the remainder.

Yes. It's on our plan. We reviewed all APIs which should move to group algorithm API in IPEX two years ago. But at that time, these APIs, like group_barrier or reduce_over_group, got some performance issues. So we didn't adopt them. For example, memory fence implicated in group_barrier is not performant compared with sycl::nd_item::barrier.

Exactly we should follow SYCL spec to move to new APIs and cowork with implementation team to optimize performance.

fengyuan14 · 2024-07-30T07:08:27Z

@xytintel Ready to preci?

xytintel · 2024-07-30T07:10:24Z

@xytintel Ready to preci?

I think yes

src/comm/XPUPair.h

guangyey · 2024-07-30T08:04:30Z

src/comm/XPUPair.h

+
+// specializations for tuple_size
+template <>
+struct tuple_size<tuple<>> {


I am not sure if tuple and tuple_size are valid structs as they are defined as forward declarations.

jbrodman · 2024-07-30T17:12:40Z

We just pushed a fix into IGC that should solve the barrier performance issue. It would be a huge benefit for the SYCL compiler and runtime to dogfood things like the group algorithms so we can make sure they're performing properly.

rolandschulz · 2024-07-30T23:27:59Z

src/comm/XPUPair.h

+template <typename T1, typename T2>
+inline void swap(T1& a, T2& b) {
+  T1 temp = a;
+  a = b;
+  b = temp;
+}


Why is this reimplementing swap and not using std::swap? This doesn't work correctly if T1 or T2 have their custom swap specialization.

rolandschulz · 2024-07-30T23:30:08Z

src/comm/XPUPair.h

+  inline void swap(pair& p) {
+    swap(first, p.first);
+    swap(second, p, second);
+  }


Should use std::swap and typo "p, second".

Suggested change

inline void swap(pair& p) {

swap(first, p.first);

swap(second, p, second);

}

inline void swap(pair& p) {

using std::swap;

swap(first, p.first);

swap(second, p.second);

}

rolandschulz · 2024-07-30T23:32:24Z

src/comm/XPUPair.h

+
+template <unsigned int N, typename T1, typename T2>
+inline typename tuple_element<N, pair<T1, T2>>::type& get(pair<T1, T2>& p) {
+  return detail::pair_get<N, pair<T1, T2>>()(p);


unnecessary complicated.

Suggested change

return detail::pair_get<N, pair<T1, T2>>()(p);

if constexpr(N==0) return p.first; else return p.second.;

Agree with your proposal. I will remove the unnecessary code.

fengyuan14 · 2024-07-31T03:08:38Z

We just pushed a fix into IGC that should solve the barrier performance issue. It would be a huge benefit for the SYCL compiler and runtime to dogfood things like the group algorithms so we can make sure they're performing properly.

We will change the APIs gradually. Thanks.

fengyuan14 · 2024-07-31T10:37:36Z

@rolandschulz Any more comments?

rolandschulz

no other comments

rolandschulz · 2024-07-31T20:48:45Z

src/comm/XPUPair.h

+  pair(const std::pair<U1, U2>& p) : first(p.first), second(p.second) {}
+
+  inline void swap(pair& p) {
+    std::swap(first, p.first);


This still doesn't work with custom swap (e.g. if you have pair<pair<...>...>). Correct use needs to enable ADL. For details see e.g. https://stackoverflow.com/questions/28130671/how-does-using-stdswap-enable-argument-dependent-lookup-adl.

Nit: This member function seems unnecessary. The implementation could go directly into the free-function.

@xytintel Please address.

rolandschulz · 2024-08-01T05:33:27Z

src/comm/XPUPair.h

+
+template <typename T1, typename T2>
+inline void swap(pair<T1, T2>& x, pair<T1, T2>& y) {
+  return x.swap(y);


you removed it but didn't move the implementation.

Changes: * Added a hack to fix 2035 in oneDNN * Commented out oneapi specific location for libOpenCL.so * Added hacks in torch-xpu-ops to WA dpc++ and intel/llvm behavior differences Above are hacks which need proper resolutions. See: uxlfoundation/oneDNN#2035 Requires: intel/torch-xpu-ops#635 Signed-off-by: Dmitry Rogozhkin <[email protected]>

guangyey requested review from fengyuan14, EikanWang, gujinghui and xytintel July 23, 2024 03:41

guangyey force-pushed the guangyey/shuffle_down branch from f80a4c8 to c94931a Compare July 23, 2024 04:10

EikanWang approved these changes Jul 23, 2024

View reviewed changes

subgroup shuffle memeber are dpreacated, use sycl function instead

c94931a

fengyuan14 changed the title ~~subgroup shuffle memeber are dpreacated, use sycl function instead~~ SYCL: using SYCL group algorithm API instead of old style for sub group utilities Jul 24, 2024

fengyuan14 changed the title ~~SYCL: using SYCL group algorithm API instead of old style for sub group utilities~~ SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities Jul 24, 2024

Merge branch 'main' into guangyey/shuffle_down

df38feb

fengyuan14 approved these changes Jul 24, 2024

View reviewed changes

fengyuan14 marked this pull request as draft July 24, 2024 00:46

solved issues relating to std::pair

3169c9c

xytintel marked this pull request as ready for review July 30, 2024 07:10

Merge branch 'main' into guangyey/shuffle_down

e5ee04f

guangyey commented Jul 30, 2024

View reviewed changes

src/comm/XPUPair.h Outdated Show resolved Hide resolved

Refine code

48b5220

guangyey commented Jul 30, 2024

View reviewed changes

rolandschulz reviewed Jul 30, 2024

View reviewed changes

xytintel added 2 commits July 31, 2024 10:56

Simplify the code for xpu::pair

0726cf3

Merge branch 'main' into guangyey/shuffle_down

993e808

rolandschulz reviewed Jul 31, 2024

View reviewed changes

xytintel added 2 commits August 1, 2024 12:41

Remove xpu::pair::swap

ac5a44a

Merge branch 'main' into guangyey/shuffle_down

b6d5ef5

rolandschulz reviewed Aug 1, 2024

View reviewed changes

Update XPUPair.h

249c147

rolandschulz approved these changes Aug 1, 2024

View reviewed changes

fengyuan14 added this pull request to the merge queue Aug 2, 2024

Merged via the queue into main with commit 9ea0728 Aug 2, 2024
2 checks passed

fengyuan14 deleted the guangyey/shuffle_down branch August 2, 2024 01:54

	return detail::pair_get<N, pair<T1, T2>>()(p);
	if constexpr(N==0) return p.first; else return p.second.;

SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities #635

SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities #635

Uh oh!

Conversation

guangyey commented Jul 23, 2024 • edited by fengyuan14 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuan14 commented Jul 24, 2024

Uh oh!

guangyey commented Jul 24, 2024

Uh oh!

fengyuan14 commented Jul 24, 2024

Uh oh!

fengyuan14 commented Jul 24, 2024

Uh oh!

fengyuan14 commented Jul 24, 2024

Uh oh!

fengyuan14 commented Jul 24, 2024

Uh oh!

jbrodman commented Jul 26, 2024

Uh oh!

jbrodman commented Jul 26, 2024

Uh oh!

fengyuan14 commented Jul 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuan14 commented Jul 30, 2024

Uh oh!

xytintel commented Jul 30, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrodman commented Jul 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengyuan14 commented Jul 31, 2024

Uh oh!

fengyuan14 commented Jul 31, 2024

Uh oh!

rolandschulz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guangyey commented Jul 23, 2024 •

edited by fengyuan14

Loading

fengyuan14 commented Jul 27, 2024 •

edited

Loading