SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

majing921201 · 2024-10-23T07:50:18Z

For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example

  auto kid = ::sycl::get_kernel_id<KernelClass>();
  auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
      ctx, {dev}, {kid});
  ::sycl::kernel k = kbundle.get_kernel(kid);
  int max_work_group_size =  k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);

We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code:

API	get_kernel_id	get_kernel_bundle	get_kernel	get_info
time (us)	0.434	42.481	4.241	1.125

We also file internal jira to track this issue. Can you help evaluate this slow performance.

The text was updated successfully, but these errors were encountered:

fengyuan14 · 2024-10-23T08:16:07Z

intel/torch-xpu-ops#1016

AlexeySachkov · 2024-10-23T08:55:57Z

Hi @majing921201,

We also file internal jira to track this issue. Can you help evaluate this slow performance.

Is your complaint that get_kernel_bundle<executable> is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

To add some background here which will likely be enough for a high-level explanation of this (but not enough to say what exactly happens in your case):

get_kernel_bundle<executable> performs necessary actions to bring device image with kernels you specified up to executable state and that may involve invoking JIT compiler if you are not using AOT. But even with AOT, it is likely that we still have to call some low-level APIs like compile/build program to be able to query the information SYCL RT was asked about.

get_kernel_bundle<executable> is implicitly used under the hood of queue::submit. We do store final executable device image into in-memory cache, so I expect that if you queried a kernel bundle explicitly, then queue::submit should be quicker, because it doesn't need to repeat the said operation anymore, but instead could just grab a result from in-memory cache. However, I'm not familiar enough with SYCL RT to say for sure and there is always risk of some bugs. In any case I expect handler::use_kernel_bundle to be the most performant option in this case, because it should avoid both in-memory cache lookup and repeating device image processing/handling.

majing921201 · 2024-10-23T09:04:32Z

Is your complaint that get_kernel_bundle is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

We used aot with 'pvc' as target. And we didn't pass kernel boudle to handler::use_kernel_bundle, Our current routine usage follows the guide in an internal jira discussion.

majing921201 mentioned this issue Oct 23, 2024

Performance: Nonzero: Worse host overhead compared with IPEX intel/torch-xpu-ops#969

Open

fengyuan14 mentioned this issue Oct 23, 2024

Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle. intel/torch-xpu-ops#1016

Open

majing921201 changed the title ~~Low performance: Query max work group size for specific kernel by SYCL API~~ Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle Oct 23, 2024

majing921201 changed the title ~~Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle~~ SYCL runtime: Severe host overhead in sycl::get_kernel_bundle Oct 23, 2024

AlexeySachkov added the performance Performance related issues label Oct 23, 2024

AlexeySachkov added the confirmed label Oct 23, 2024

majing921201 mentioned this issue Oct 29, 2024

Performance: Reduction: Worse host overhead compared with IPEX intel/torch-xpu-ops#970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

majing921201 commented Oct 23, 2024 •

edited

Loading

fengyuan14 commented Oct 23, 2024

AlexeySachkov commented Oct 23, 2024

majing921201 commented Oct 23, 2024 •

edited

Loading

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

Comments

majing921201 commented Oct 23, 2024 • edited Loading

fengyuan14 commented Oct 23, 2024

AlexeySachkov commented Oct 23, 2024

majing921201 commented Oct 23, 2024 • edited Loading

majing921201 commented Oct 23, 2024 •

edited

Loading

majing921201 commented Oct 23, 2024 •

edited

Loading