-
Notifications
You must be signed in to change notification settings - Fork 11.6k
sycl : Implemented reorder Q4_K mmvq #13109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alberto Cabrera <[email protected]>
Signed-off-by: Alberto Cabrera <[email protected]>
@@ -3636,22 +3664,65 @@ static void reorder_qw(char *data_device, const int ncols, const int nrows, | |||
sycl::free(tmp_buf, *stream); | |||
} | |||
|
|||
static void reorder_qw(ggml_tensor * src0, dpct::queue_ptr stream) { | |||
static void reorder_qw_q4_k(char * data_device, size_t size, size_t offset, dpct::queue_ptr stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a specific reason data_device is declared as a char* instead of a uint8_t*, especially considering it's later cast to uint8_t* as qs_ptr anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you share the GPU type of above test result?
- Have you test the PR by local UT?
- Could you check the detailed output of Q4_K LLM?
I guess the output should be different to legacy code.
// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization | ||
// is enabled takes precedence over DMMV, the current if-else implementation | ||
// requires disabling DMMV if both conditions are met | ||
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.
I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like by this sentence.
@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
// KQ + KQV multi-batch | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | |||
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); | |||
constexpr bool convert_src1_to_q8_1 = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.
@sgeor255 We need promote SYCL backend in related cases. :) |
const int is = 2 * il; | ||
const int n = 4; | ||
|
||
item_ct1.barrier(sycl::access::fence_space::local_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this barrier outside this function, before the callsite.
This barrier within the context of this function does not explain why it's required.
const ggml_half2 * dm_ptr = reinterpret_cast<const ggml_half2 *>(base + dm_offset); | ||
|
||
const float dall = dm_ptr->x(); | ||
const float dmin = dm_ptr->y(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ggml_half2 * dm_ptr = reinterpret_cast<const ggml_half2 *>(base + dm_offset); | |
const float dall = dm_ptr->x(); | |
const float dmin = dm_ptr->y(); | |
auto dm_values = *reinterpret_cast<const ggml_half2 *>(base + dm_offset); | |
const float dall = dm_values.x(); | |
const float dmin = dm_values.(); |
You can fetch the data you need in one read and avoid making twice the number of trips to memory
static void dequantize_block_q4_K_reorder(const void * __restrict__ vx, dst_t * __restrict__ yy, uint8_t * scales_local, | ||
const sycl::nd_item<3> & item_ct1, int64_t nb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the __restrict__
keyword here. I believe it's a result of translation from the cuda backend via dpct. __restrict__
often forces the nvcc compiler to use the L1 read only cache hence the usage there, but on our side, I do not suppose it serves any purpose, and we can remove it.
Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just another comment to cover removing all other usages of restrict as well, instead of highlighting each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems __restrict__
is respected by dpcpp and used to determine whether to apply optimizations so I think I will keep it.
@@ -406,6 +416,35 @@ static void dequantize_block_q4_K(const void * __restrict__ vx, dst_t * __restri | |||
#endif | |||
} | |||
|
|||
template <typename dst_t> | |||
static void dequantize_block_q4_K_reorder(const void * __restrict__ vx, dst_t * __restrict__ yy, uint8_t * scales_local, | |||
const sycl::nd_item<3> & item_ct1, int64_t nb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we can start using 1D kernel launches as we start adding kernels manually. 3 Dimensional kernels are not a deliberate choice the backend makes but rather is a result of direct translation of dim3 in the cuda backend.
A 3D launch does not serve any purpose here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly for all other usages of 3D nd_item as well
const float dmin, uint8_t * __restrict__ scales_local, | ||
const sycl::nd_item<3> & item_ct1, int il, int ir) { | ||
const int is = 2 * il; | ||
const int n = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int n = 4; | |
constexpr int n = 4; |
auto * scales_ptr = qs_ptr + QK_K / 2 * nblocks; | ||
auto * dm_ptr = (sycl::half2 *) (scales_ptr + K_SCALE_SIZE * nblocks); | ||
|
||
stream->parallel_for(nblocks, [=](auto i) [[sycl::reqd_sub_group_size(WARP_SIZE)]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream->parallel_for(nblocks, [=](auto i) [[sycl::reqd_sub_group_size(WARP_SIZE)]] { | |
stream->parallel_for(nblocks, [=](auto i) { |
IGC is the best judge for selecting the subgroup size for kernels where we do not need a particular sub group size.
default: | ||
GGML_SYCL_DEBUG("reorder_qw() called with unsupported type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert or std::runtime error ?. I believe these would be much better than silently doing nothing and resulting in an incorrect output IMO.
Do we expect this function to be called in cases where we do not support the reorder ?
This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for 8cbe2c9 .
Some performance numbers on lunar lake below:
GGML_SYCL_DISABLE_OPT=0
build: 52b1622 (5099)
GGML_SYCL_DISABLE_OPT=1
build: 52b1622 (5099)