Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities #635

Merged
merged 10 commits into from
Aug 2, 2024

Conversation

guangyey
Copy link
Contributor

@guangyey guangyey commented Jul 23, 2024

Change 1:
SYCL is shifting to SYCL group algorithm API to unify sub-group, work-group APIs, e.g. reduce_over_group, group_barrier, and shift_group here. And old style (use separate member function of each structure) is being deprecated.

Old SYCL-2020
sg.shuffle_down(x, 1) sycl::shift_group_left(sg, x, 1)
sg.shuffle_up(x, 1) sycl::shift_group_right(sg, x, 1)
sg.shuffle(x, id) sycl::select_from_group(sg, x, id)
sg.shuffle_xor(x, mask) sycl::permute_group_by_xor(sg, x, mask)

FYI: Please don't use sg.shuffle members anymore since they are deprecated.

Change 2:
sycl::shfit_group_xxx is more restricted than sg.shuffle, requiring the object shifted is C++ trivially copyable object. We implemented private pair instead of std::pair in the commit.
FYI: CUDA is using thrust::pair in kernel.

@fengyuan14 fengyuan14 changed the title subgroup shuffle memeber are dpreacated, use sycl function instead SYCL: using SYCL group algorithm API instead of old style for sub group utilities Jul 24, 2024
@fengyuan14 fengyuan14 changed the title SYCL: using SYCL group algorithm API instead of old style for sub group utilities SYCL: using SYCL group algorithm API instead of old style for sub group shift utilities Jul 24, 2024
@fengyuan14
Copy link
Contributor

What's the bundle required?
image

@fengyuan14 fengyuan14 marked this pull request as draft July 24, 2024 00:46
@guangyey
Copy link
Contributor Author

What's the bundle required? image

THe reason is that std::pair is not supported in sycl::shift_group_left. @xytintel will be going to implement a customed pair structure, a countpart thrust::pair, supported in sycl::shift_group_left and use the customed pair in our SYCL kernel code.

@fengyuan14
Copy link
Contributor

/opt/intel/oneapi/pytorch-gpu-dev-0.5/include/sycl/group_algorithm.hpp:541:1: note: candidate template ignored: requirement 'std::is_trivially_copyable_v<std::pair<unsigned char, unsigned char>> || detail::is_vec<std::pair<unsigned char, unsigned char>>::value' was not satisfied [with Group = sub_group, T = std::pair<unsigned char, unsigned char>]

Cannot totally understand the failure. As SYCL spec, std::pair is supposed to C++ trivially copyable.
image

@fengyuan14
Copy link
Contributor

Verify on c++17, std::pair is not a C++ trivially copyable structure.

(dev) fengyuan@fy-9900:~/workspace/test$ cat test.cpp
#include <iostream>
#include <utility>
#include <type_traits>

int main() {
  std::cout << std::is_trivially_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  return 0;
}
(dev) fengyuan@fy-9900:~/workspace/test$ g++ -std=c++17 test.cpp  -o test
(dev) fengyuan@fy-9900:~/workspace/test$ ./test
0

@fengyuan14
Copy link
Contributor

/opt/intel/oneapi/pytorch-gpu-dev-0.5/include/sycl/group_algorithm.hpp:541:1: note: candidate template ignored: requirement 'std::is_trivially_copyable_v<std::pair<unsigned char, unsigned char>> || detail::is_vec<std::pair<unsigned char, unsigned char>>::value' was not satisfied [with Group = sub_group, T = std::pair<unsigned char, unsigned char>]

Cannot totally understand the failure. As SYCL spec, std::pair is supposed to C++ trivially copyable. image

Spec is saying the std structures listed are a plus list to support device copyable in SYCL.

@fengyuan14
Copy link
Contributor

Verified,

(dev) fengyuan@fy-9900:~/workspace/test$ cat test.cpp
#include <iostream>
#include <utility>
#include <type_traits>
#include <tuple>
#include <sycl/sycl.hpp>

int main() {
  std::cout << std::is_trivially_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  std::cout << sycl::is_device_copyable<std::pair<unsigned char, unsigned char>>::value << std::endl;
  return 0;
}
(dev) fengyuan@fy-9900:~/workspace/test$ icpx -fsycl -std=c++17 test.cpp  -o test
(dev) fengyuan@fy-9900:~/workspace/test$ ./test
0
1

@jbrodman
Copy link

Hi - I think several cases where you're migrating to shifts or permutes could also be greatly simplified by using SYCL 2020 reduce_over_group at either sub-group or work-group scope.

@jbrodman
Copy link

Additionally, any calls to nd_item::barrier should move to sycl::group_barrier.

@fengyuan14
Copy link
Contributor

fengyuan14 commented Jul 27, 2024

Additionally, any calls to nd_item::barrier should move to sycl::group_barrier.

@jbrodman Thanks for the remainder.

Yes. It's on our plan. We reviewed all APIs which should move to group algorithm API in IPEX two years ago. But at that time, these APIs, like group_barrier or reduce_over_group, got some performance issues. So we didn't adopt them. For example, memory fence implicated in group_barrier is not performant compared with sycl::nd_item::barrier.

Exactly we should follow SYCL spec to move to new APIs and cowork with implementation team to optimize performance.

@fengyuan14
Copy link
Contributor

@xytintel Ready to preci?

@xytintel
Copy link
Contributor

@xytintel Ready to preci?

I think yes

@xytintel xytintel marked this pull request as ready for review July 30, 2024 07:10
src/comm/XPUPair.h Outdated Show resolved Hide resolved

// specializations for tuple_size
template <>
struct tuple_size<tuple<>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if tuple and tuple_size are valid structs as they are defined as forward declarations.

@jbrodman
Copy link

We just pushed a fix into IGC that should solve the barrier performance issue. It would be a huge benefit for the SYCL compiler and runtime to dogfood things like the group algorithms so we can make sure they're performing properly.

Comment on lines 49 to 54
template <typename T1, typename T2>
inline void swap(T1& a, T2& b) {
T1 temp = a;
a = b;
b = temp;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this reimplementing swap and not using std::swap? This doesn't work correctly if T1 or T2 have their custom swap specialization.

Comment on lines 75 to 78
inline void swap(pair& p) {
swap(first, p.first);
swap(second, p, second);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use std::swap and typo "p, second".

Suggested change
inline void swap(pair& p) {
swap(first, p.first);
swap(second, p, second);
}
inline void swap(pair& p) {
using std::swap;
swap(first, p.first);
swap(second, p.second);
}


template <unsigned int N, typename T1, typename T2>
inline typename tuple_element<N, pair<T1, T2>>::type& get(pair<T1, T2>& p) {
return detail::pair_get<N, pair<T1, T2>>()(p);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary complicated.

Suggested change
return detail::pair_get<N, pair<T1, T2>>()(p);
if constexpr(N==0) return p.first; else return p.second.;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your proposal. I will remove the unnecessary code.

@fengyuan14
Copy link
Contributor

We just pushed a fix into IGC that should solve the barrier performance issue. It would be a huge benefit for the SYCL compiler and runtime to dogfood things like the group algorithms so we can make sure they're performing properly.

We will change the APIs gradually. Thanks.

@fengyuan14
Copy link
Contributor

@rolandschulz Any more comments?

Copy link

@rolandschulz rolandschulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other comments

pair(const std::pair<U1, U2>& p) : first(p.first), second(p.second) {}

inline void swap(pair& p) {
std::swap(first, p.first);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't work with custom swap (e.g. if you have pair<pair<...>...>). Correct use needs to enable ADL. For details see e.g. https://stackoverflow.com/questions/28130671/how-does-using-stdswap-enable-argument-dependent-lookup-adl.

Nit: This member function seems unnecessary. The implementation could go directly into the free-function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xytintel Please address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


template <typename T1, typename T2>
inline void swap(pair<T1, T2>& x, pair<T1, T2>& y) {
return x.swap(y);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed it but didn't move the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@fengyuan14 fengyuan14 added this pull request to the merge queue Aug 2, 2024
Merged via the queue into main with commit 9ea0728 Aug 2, 2024
2 checks passed
@fengyuan14 fengyuan14 deleted the guangyey/shuffle_down branch August 2, 2024 01:54
dvrogozh added a commit to dvrogozh/pytorch that referenced this pull request Aug 14, 2024
Changes:
* Added a hack to fix 2035 in oneDNN
* Commented out oneapi specific location for libOpenCL.so
* Added hacks in torch-xpu-ops to WA dpc++ and intel/llvm behavior differences

Above are hacks which need proper resolutions.

See: oneapi-src/oneDNN#2035
Requires: intel/torch-xpu-ops#635
Signed-off-by: Dmitry Rogozhkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants