Enable SM90 via sycl-cuda-compat #276

FMarno · 2025-03-24T17:04:31Z

Builds on #266 to enable a number of examples running on nvidia hopper with minimal changes to the code. This is achieved through the use of the sycl-cuda-compat flag.
This is a big step towards easy compatibility with the upstream CUTLASS.
Shoutout to @Naghasan for developing the sycl-cuda-compat feature for dpcpp.

rolandschulz · 2025-04-01T15:42:00Z

tools/util/include/cutlass/util/reference/device/gett.hpp

+    decltype(A),
+    decltype(B),
+    decltype(C),
+    decltype(D),


this should be D, A, B, C

it's quite ugly and error prone that type deduction doesn't work here.
I'm wondering whether it is worth to add a macro so that we can pass a type representing function and make type deduction for the arguments work.
Could look like: https://godbolt.org/z/8cxcM15ba
This can be done quite a bit nicer with c++20: https://godbolt.org/z/7fK1M5he3. But I don't know whether it is reasonable to require C++20 for CUTLASS_ENABLE_SYCL

The function parameter and the function template parameters are not in the same order, so the order A, B, C, D is correct. I agree though, this is a sharp edge of syclcompat.
I think this would have to be handled at the syclcompat level, since api of syclcompat::experimental::launch expects a function as a parameter.

template < class ATensor, class BTensor, class CTensor, class DTensor, class ElementAccumulator, class ElementEpilogue> CUTLASS_GLOBAL void gett_kernel( DTensor D, ATensor const A, BTensor const B, CTensor const C, ElementEpilogue alpha, ElementEpilogue beta, ElementAccumulator acc_init) {

Yes I overlooked that the template arguments are in a different order.

I agree that if we want to use syclcompat for launch we would need to ask them to add this.

Do we gain anything from using syclcompat::launch here rather than directly calling default_queue().submit(...)? Only thing I see syclcompat::launch does is call transform_nd_range. But for 3d it doesn't do anything.

In this case, the only benefit is handling the conversion of sycl_grid and sycl_block into sycl::nd_range. The code you're suggestions would look like this:

const syclcompat::dim3 sycl_grid(dimGrid.x, dimGrid.y, dimGrid.z); const syclcompat::dim3 sycl_block(dimBlock.x, dimBlock.y, dimBlock.z); syclcompat::get_default_queue().parallel_for(sycl::nd_range<3>{sycl_grid * sycl_block, sycl_block}, [=](sycl::nd_item<3>) { [[clang::always_inline]] gett_kernel(D, A, B, C, alpha, beta, ElementAccumulator(0)); });

For cases that use syclcompat::experimental::launch, it also handles properties like SLM size, cluster parameters, subgroup size etc.

I guess we should try to solve it in syclcompat (if at all). Do you want to suggest it or do you want me to open an issue?

I'll open an issue on the intel/llvm repo and bring it up internally within Codeplay. I'll also send you a link to the issue if you want to pass it on to anyone.

I've created that here #17832

@rolandschulz After a bit of Codeplay internal discussion, we're not sure it can be solved in C++17 without modifying the syclcompat api to accept a functor class as the template argument, which could be considered veering from the point of syclcompat.
Is it ok if I resolve this for now so the code can be merged?

examples/52_hopper_gather_scatter_fusion/gather_kernel.cuh

aacostadiaz

Great!!!

rolandschulz · 2025-04-12T18:10:18Z

tools/util/include/cutlass/util/reference/device/gemm_complex.h

+  syclcompat::dim3 sycl_grid(grid.x, grid.y, grid.z);
+  syclcompat::dim3 sycl_block(block.x, block.y, block.z);


What's the point of these two lines? Couldn't you use dimBlock/dimGrid directly?

There is no conversion between dim3 and syclcompat::dim3

This reverts commit aa1e8b2.

FMarno · 2025-05-14T16:34:30Z

tools/util/include/cutlass/util/mixed_dtype_utils.hpp

 #endif
+  CUDA_CHECK(cudaStreamSynchronize(stream));


changes in these files might be bad

FMarno · 2025-05-14T16:35:55Z

TODO: see if the launch syntax from here can simplify some areas
https://github.com/codeplaysoftware/cutlass-sycl/pull/305/files

…_sm90_sycl-cuda-compat

FMarno force-pushed the finlay/enable_sm90_sycl-cuda-compat branch 4 times, most recently from 003b9d9 to c327f8e Compare April 1, 2025 08:51

FMarno marked this pull request as ready for review April 1, 2025 08:51

rolandschulz reviewed Apr 1, 2025

View reviewed changes

FMarno mentioned this pull request Apr 3, 2025

[COMPAT] Launch template argument deduction is easy to get wrong intel/llvm#17832

Open

aacostadiaz reviewed Apr 4, 2025

View reviewed changes

examples/52_hopper_gather_scatter_fusion/gather_kernel.cuh Show resolved Hide resolved

aacostadiaz approved these changes Apr 4, 2025

View reviewed changes

aacostadiaz mentioned this pull request Apr 4, 2025

Enable cuda compatibilty mode with SYCL #185

Closed

joeatodd mentioned this pull request Apr 8, 2025

Organise SYCL Examples #297

Merged

rolandschulz reviewed Apr 12, 2025

View reviewed changes

FMarno mentioned this pull request Apr 16, 2025

RFC: test out new syntax for launch with type deduction #305

Open

FMarno added 7 commits May 14, 2025 17:32

Changes to enable sycl-cuda-compat for sm90a

b9aac74

swap __global__ for CUTLASS_GLOBAL

6fa9d06

fix float4 not in cutlass namespace

83d700b

fix benchmark linking for nvidia

70cbd08

update nvidia test arch

04962b4

fix gett args

895f9ed

Revert "fix gett args"

ecd0a1f

This reverts commit aa1e8b2.

FMarno force-pushed the finlay/enable_sm90_sycl-cuda-compat branch from 4aae5fd to ecd0a1f Compare May 14, 2025 16:33

FMarno commented May 14, 2025

View reviewed changes

FMarno added 2 commits May 14, 2025 17:45

fix bad merge

dfc1d54

Merge remote-tracking branch 'origin/sycl-develop' into finlay/enable…

1d90638

…_sm90_sycl-cuda-compat

FMarno force-pushed the finlay/enable_sm90_sycl-cuda-compat branch from ed65128 to 1d90638 Compare May 14, 2025 16:46

Ruyk self-assigned this May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable SM90 via sycl-cuda-compat #276

Enable SM90 via sycl-cuda-compat #276

Uh oh!

FMarno commented Mar 24, 2025 •

edited

Loading

Uh oh!

rolandschulz Apr 1, 2025

Uh oh!

rolandschulz Apr 1, 2025

Uh oh!

FMarno Apr 3, 2025

Uh oh!

rolandschulz Apr 3, 2025

Uh oh!

FMarno Apr 3, 2025

Uh oh!

rolandschulz Apr 3, 2025

Uh oh!

FMarno Apr 3, 2025

Uh oh!

FMarno Apr 3, 2025 •

edited

Loading

Uh oh!

FMarno Apr 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

aacostadiaz left a comment

Uh oh!

rolandschulz Apr 12, 2025

Uh oh!

FMarno Apr 15, 2025

Uh oh!

FMarno May 14, 2025

Uh oh!

FMarno commented May 14, 2025

Uh oh!

Uh oh!

		syclcompat::dim3 sycl_grid(grid.x, grid.y, grid.z);
		syclcompat::dim3 sycl_block(block.x, block.y, block.z);

Enable SM90 via sycl-cuda-compat #276

Are you sure you want to change the base?

Enable SM90 via sycl-cuda-compat #276

Uh oh!

Conversation

FMarno commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FMarno Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FMarno Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aacostadiaz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FMarno commented May 14, 2025

Uh oh!

Uh oh!

FMarno commented Mar 24, 2025 •

edited

Loading

FMarno Apr 3, 2025 •

edited

Loading

FMarno Apr 7, 2025 •

edited

Loading