Skip to content

sycl : Implemented reorder Q4_K mmvq #13109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sgeor255
Copy link
Contributor

This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for 8cbe2c9 .

Some performance numbers on lunar lake below:

  • Q4_K reorder with GGML_SYCL_DISABLE_OPT=0
model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none pp512 1586.19 ± 69.35
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none tg128 41.23 ± 0.43
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none pp512 550.65 ± 1.35
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none tg128 17.67 ± 1.05
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none pp512 616.41 ± 12.21
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none tg128 28.57 ± 0.32
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none pp512 508.14 ± 1.50
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none tg128 13.75 ± 0.12
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none pp512 827.73 ± 26.59
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none tg128 21.45 ± 0.17

build: 52b1622 (5099)

  • Q4_K reorder with GGML_SYCL_DISABLE_OPT=1
model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none pp512 1576.79 ± 80.93
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 8 none tg128 36.27 ± 0.43
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none pp512 551.82 ± 1.63
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 8 none tg128 12.24 ± 1.19
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none pp512 586.64 ± 1.65
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 8 none tg128 24.04 ± 0.41
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none pp512 509.51 ± 0.87
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none tg128 10.18 ± 0.04
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none pp512 825.29 ± 26.93
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 8 none tg128 17.83 ± 0.05

build: 52b1622 (5099)

  • TODO
    • Performance on BMG and ARC

@Alcpz Alcpz changed the title sycl : Implemented reorder Q4_0 mmvq sycl : Implemented reorder Q4_K mmvq Apr 25, 2025
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 25, 2025
@@ -3636,22 +3664,65 @@ static void reorder_qw(char *data_device, const int ncols, const int nrows,
sycl::free(tmp_buf, *stream);
}

static void reorder_qw(ggml_tensor * src0, dpct::queue_ptr stream) {
static void reorder_qw_q4_k(char * data_device, size_t size, size_t offset, dpct::queue_ptr stream) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a specific reason data_device is declared as a char* instead of a uint8_t*, especially considering it's later cast to uint8_t* as qs_ptr anyway?

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you share the GPU type of above test result?
  2. Have you test the PR by local UT?
  3. Could you check the detailed output of Q4_K LLM?
    I guess the output should be different to legacy code.

// Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization
// is enabled takes precedence over DMMV, the current if-else implementation
// requires disabling DMMV if both conditions are met
|| (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.

I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like by this sentence.

@@ -2968,14 +2994,17 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
// KQ + KQV multi-batch
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
} else if (use_dequantize_mul_mat_vec) {
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false);
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());
constexpr bool convert_src1_to_q8_1 = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.

@NeoZhangJianyu
Copy link
Collaborator

@sgeor255
Here is a discussion about Q4_K. #13120 (reply in thread)
Could you test the model by this PR?
If result is good, could you reply with your test result?

We need promote SYCL backend in related cases. :)

const int is = 2 * il;
const int n = 4;

item_ct1.barrier(sycl::access::fence_space::local_space);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this barrier outside this function, before the callsite.
This barrier within the context of this function does not explain why it's required.

Comment on lines +436 to +439
const ggml_half2 * dm_ptr = reinterpret_cast<const ggml_half2 *>(base + dm_offset);

const float dall = dm_ptr->x();
const float dmin = dm_ptr->y();
Copy link
Contributor

@AD2605 AD2605 Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ggml_half2 * dm_ptr = reinterpret_cast<const ggml_half2 *>(base + dm_offset);
const float dall = dm_ptr->x();
const float dmin = dm_ptr->y();
auto dm_values = *reinterpret_cast<const ggml_half2 *>(base + dm_offset);
const float dall = dm_values.x();
const float dmin = dm_values.();

You can fetch the data you need in one read and avoid making twice the number of trips to memory

Comment on lines +420 to +421
static void dequantize_block_q4_K_reorder(const void * __restrict__ vx, dst_t * __restrict__ yy, uint8_t * scales_local,
const sycl::nd_item<3> & item_ct1, int64_t nb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the __restrict__ keyword here. I believe it's a result of translation from the cuda backend via dpct. __restrict__ often forces the nvcc compiler to use the L1 read only cache hence the usage there, but on our side, I do not suppose it serves any purpose, and we can remove it.
Thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just another comment to cover removing all other usages of restrict as well, instead of highlighting each one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems __restrict__ is respected by dpcpp and used to determine whether to apply optimizations so I think I will keep it.

@@ -406,6 +416,35 @@ static void dequantize_block_q4_K(const void * __restrict__ vx, dst_t * __restri
#endif
}

template <typename dst_t>
static void dequantize_block_q4_K_reorder(const void * __restrict__ vx, dst_t * __restrict__ yy, uint8_t * scales_local,
const sycl::nd_item<3> & item_ct1, int64_t nb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we can start using 1D kernel launches as we start adding kernels manually. 3 Dimensional kernels are not a deliberate choice the backend makes but rather is a result of direct translation of dim3 in the cuda backend.
A 3D launch does not serve any purpose here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly for all other usages of 3D nd_item as well

const float dmin, uint8_t * __restrict__ scales_local,
const sycl::nd_item<3> & item_ct1, int il, int ir) {
const int is = 2 * il;
const int n = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int n = 4;
constexpr int n = 4;

auto * scales_ptr = qs_ptr + QK_K / 2 * nblocks;
auto * dm_ptr = (sycl::half2 *) (scales_ptr + K_SCALE_SIZE * nblocks);

stream->parallel_for(nblocks, [=](auto i) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream->parallel_for(nblocks, [=](auto i) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
stream->parallel_for(nblocks, [=](auto i) {

IGC is the best judge for selecting the subgroup size for kernels where we do not need a particular sub group size.

Comment on lines +3711 to +3712
default:
GGML_SYCL_DEBUG("reorder_qw() called with unsupported type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert or std::runtime error ?. I believe these would be much better than silently doing nothing and resulting in an incorrect output IMO.
Do we expect this function to be called in cases where we do not support the reorder ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants