[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

JackAKirk · 2025-01-15T16:30:59Z

syclcompat::permute_sub_group_by_xor was reported to flakily fail on L0. Closer inspection revealed that the implementation of permute_sub_group_by_xor is incorrect for cases where logical_sub_group_size != 32, which is one of the test cases. This implies that the test itself is wrong.

In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case logical_sub_group_size == 32, in order to:

Ensure the only useful case is working via the correct optimized route.
Check that this improvement doesn't break the suspicious test.

A follow on PR can fix the other cases where logical_sub_group_size != 32: this is better to do later, since

the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see [SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671) and any user is advised to use such algorithms instead of reimplementing them themselves.
This must I think require a complete reworking of the test and would otherwise delay the more important change here.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2025-01-15T16:38:16Z

syclomatic translates __shfl_xor_sync() to permute_sub_group_by_xor (
__shfl_xor_sync() is defined as (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-description):

"
__shfl_xor_sync() calculates a source line ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
"

However as per its own description

https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/util.hpp#L291

permute_sub_group_by_xor is implemented according to a different definition unless logical_sub_group_size == 32.

Signed-off-by: JackAKirk <[email protected]>

Optimize/(fix?) permute_sub_group_by_xor

1b09219

Signed-off-by: JackAKirk <[email protected]>

JackAKirk requested a review from a team as a code owner January 15, 2025 16:31

JackAKirk temporarily deployed to WindowsCILock January 15, 2025 16:32 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock January 15, 2025 17:13 — with GitHub Actions Inactive

Split test into two test cases for easier debugging.

bf13d41

Signed-off-by: JackAKirk <[email protected]>

JackAKirk had a problem deploying to WindowsCILock January 16, 2025 11:30 — with GitHub Actions Error

Add missing host_dev_data_u

affc058

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock January 16, 2025 12:11 — with GitHub Actions Inactive

JackAKirk requested a deployment to WindowsCILock January 16, 2025 12:39 — with GitHub Actions In progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

JackAKirk commented Jan 15, 2025 •

edited

Loading

JackAKirk commented Jan 15, 2025 •

edited

Loading

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

Are you sure you want to change the base?

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

Conversation

JackAKirk commented Jan 15, 2025 • edited Loading

JackAKirk commented Jan 15, 2025 • edited Loading

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

JackAKirk commented Jan 15, 2025 •

edited

Loading

JackAKirk commented Jan 15, 2025 •

edited

Loading