Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

Open
wants to merge 3 commits into
base: sycl
Choose a base branch
from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Jan 15, 2025

syclcompat::permute_sub_group_by_xor was reported to flakily fail on L0. Closer inspection revealed that the implementation of permute_sub_group_by_xor is incorrect for cases where logical_sub_group_size != 32, which is one of the test cases. This implies that the test itself is wrong.

In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case logical_sub_group_size == 32, in order to:

  • Ensure the only useful case is working via the correct optimized route.
  • Check that this improvement doesn't break the suspicious test.

A follow on PR can fix the other cases where logical_sub_group_size != 32: this is better to do later, since

  • the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see [SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671) and any user is advised to use such algorithms instead of reimplementing them themselves.
  • This must I think require a complete reworking of the test and would otherwise delay the more important change here.

@JackAKirk JackAKirk requested a review from a team as a code owner January 15, 2025 16:31
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jan 15, 2025

syclomatic translates __shfl_xor_sync() to permute_sub_group_by_xor (
__shfl_xor_sync() is defined as (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-description):

"
__shfl_xor_sync() calculates a source line ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
"

However as per its own description

https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/util.hpp#L291

permute_sub_group_by_xor is implemented according to a different definition unless logical_sub_group_size == 32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant