Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

Open
majing921201 opened this issue Oct 23, 2024 · 3 comments
Open

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824

majing921201 opened this issue Oct 23, 2024 · 3 comments
Labels
confirmed performance Performance related issues

Comments

@majing921201
Copy link

majing921201 commented Oct 23, 2024

For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example

  auto kid = ::sycl::get_kernel_id<KernelClass>();
  auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
      ctx, {dev}, {kid});
  ::sycl::kernel k = kbundle.get_kernel(kid);
  int max_work_group_size =  k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev); 

We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code:

<style> </style>
API get_kernel_id get_kernel_bundle get_kernel get_info
time (us) 0.434 42.481 4.241 1.125

We also file internal jira to track this issue. Can you help evaluate this slow performance.

@fengyuan14
Copy link

intel/torch-xpu-ops#1016

@majing921201 majing921201 changed the title Low performance: Query max work group size for specific kernel by SYCL API Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle Oct 23, 2024
@majing921201 majing921201 changed the title Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle SYCL runtime: Severe host overhead in sycl::get_kernel_bundle Oct 23, 2024
@AlexeySachkov
Copy link
Contributor

Hi @majing921201,

We also file internal jira to track this issue. Can you help evaluate this slow performance.

Is your complaint that get_kernel_bundle<executable> is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

To add some background here which will likely be enough for a high-level explanation of this (but not enough to say what exactly happens in your case):

get_kernel_bundle<executable> performs necessary actions to bring device image with kernels you specified up to executable state and that may involve invoking JIT compiler if you are not using AOT. But even with AOT, it is likely that we still have to call some low-level APIs like compile/build program to be able to query the information SYCL RT was asked about.

get_kernel_bundle<executable> is implicitly used under the hood of queue::submit. We do store final executable device image into in-memory cache, so I expect that if you queried a kernel bundle explicitly, then queue::submit should be quicker, because it doesn't need to repeat the said operation anymore, but instead could just grab a result from in-memory cache. However, I'm not familiar enough with SYCL RT to say for sure and there is always risk of some bugs. In any case I expect handler::use_kernel_bundle to be the most performant option in this case, because it should avoid both in-memory cache lookup and repeating device image processing/handling.

@AlexeySachkov AlexeySachkov added the performance Performance related issues label Oct 23, 2024
@majing921201
Copy link
Author

majing921201 commented Oct 23, 2024

Is your complaint that get_kernel_bundle is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

We used aot with 'pvc' as target. And we didn't pass kernel boudle to handler::use_kernel_bundle, Our current routine usage follows the guide in an internal jira discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed performance Performance related issues
Projects
None yet
Development

No branches or pull requests

3 participants