-
Notifications
You must be signed in to change notification settings - Fork 29
Enable SM90 via sycl-cuda-compat #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl-develop
Are you sure you want to change the base?
Enable SM90 via sycl-cuda-compat #276
Conversation
2368a71
to
a2254d3
Compare
003b9d9
to
c327f8e
Compare
decltype(A), | ||
decltype(B), | ||
decltype(C), | ||
decltype(D), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be D, A, B, C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's quite ugly and error prone that type deduction doesn't work here.
I'm wondering whether it is worth to add a macro so that we can pass a type representing function and make type deduction for the arguments work.
Could look like: https://godbolt.org/z/8cxcM15ba
This can be done quite a bit nicer with c++20: https://godbolt.org/z/7fK1M5he3. But I don't know whether it is reasonable to require C++20 for CUTLASS_ENABLE_SYCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parameter and the function template parameters are not in the same order, so the order A, B, C, D is correct. I agree though, this is a sharp edge of syclcompat.
I think this would have to be handled at the syclcompat level, since api of syclcompat::experimental::launch
expects a function as a parameter.
template <
class ATensor,
class BTensor,
class CTensor,
class DTensor,
class ElementAccumulator,
class ElementEpilogue>
CUTLASS_GLOBAL
void
gett_kernel(
DTensor D,
ATensor const A,
BTensor const B,
CTensor const C,
ElementEpilogue alpha, ElementEpilogue beta,
ElementAccumulator acc_init)
{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I overlooked that the template arguments are in a different order.
I agree that if we want to use syclcompat for launch we would need to ask them to add this.
Do we gain anything from using syclcompat::launch
here rather than directly calling default_queue().submit(...)
? Only thing I see syclcompat::launch
does is call transform_nd_range
. But for 3d it doesn't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the only benefit is handling the conversion of sycl_grid
and sycl_block
into sycl::nd_range
. The code you're suggestions would look like this:
const syclcompat::dim3 sycl_grid(dimGrid.x, dimGrid.y, dimGrid.z);
const syclcompat::dim3 sycl_block(dimBlock.x, dimBlock.y, dimBlock.z);
syclcompat::get_default_queue().parallel_for(sycl::nd_range<3>{sycl_grid * sycl_block, sycl_block},
[=](sycl::nd_item<3>) {
[[clang::always_inline]] gett_kernel(D, A, B, C, alpha, beta, ElementAccumulator(0));
});
For cases that use syclcompat::experimental::launch
, it also handles properties like SLM size, cluster parameters, subgroup size etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should try to solve it in syclcompat (if at all). Do you want to suggest it or do you want me to open an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue on the intel/llvm repo and bring it up internally within Codeplay. I'll also send you a link to the issue if you want to pass it on to anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created that here #17832
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rolandschulz After a bit of Codeplay internal discussion, we're not sure it can be solved in C++17 without modifying the syclcompat api to accept a functor class as the template argument, which could be considered veering from the point of syclcompat.
Is it ok if I resolve this for now so the code can be merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!!!
syclcompat::dim3 sycl_grid(grid.x, grid.y, grid.z); | ||
syclcompat::dim3 sycl_block(block.x, block.y, block.z); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of these two lines? Couldn't you use dimBlock/dimGrid directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no conversion between dim3
and syclcompat::dim3
Builds on #266 to enable a number of examples running on nvidia hopper with minimal changes to the code. This is achieved through the use of the sycl-cuda-compat flag.
This is a big step towards easy compatibility with the upstream CUTLASS.
Shoutout to @Naghasan for developing the sycl-cuda-compat feature for dpcpp.