Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

optimize opencl kernel building time (still has problem in win10) #34

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

liyuming1978
Copy link

@liyuming1978 liyuming1978 commented May 31, 2018

now, for ssd model , in linux it cost 5s to load, and windows ~10s...
I have done the optimze for kernel build in windows, linux need add set_exepath();
I have not test in linux, just for windows (my isv need windows version) to speedup the loading time.
now, the status in windows is still bad. if kernel always build.
the clCreateBuffer will cost 2s, and clGetplatformids 2s. clBuildProgram 7s

after optimzation, clBuildProgram gone, but clCreateBuffer 9s...
I use Skylake GT3e, and kabylake GT2, the same in win10. the driver is https://downloadmirror.intel.com/27803/a08/win64_24.20.100.6094.exe

the two models are https://github.com/liyuming1978/openvino_example/tree/master/windows/facedemo/facedemo/model

@liyuming1978
Copy link
Author

in win10, I find the cpu cost is high when do infer (or wait infer-request), the clWaitForEvents seems to a busy wait.

@liyuming1978
Copy link
Author

liyuming1978 commented Jun 12, 2018

for the long time of clCreateBuffer , got it, too much clCreateBuffer (>30000) (the speed is the same as CL_MEM_USE_HOST_PTR , but too much calls) , I will try drop 7.0 since it has memory optimization.

clGetplatformids is indeed slow, I have also tried my code and clcaffe, the same , 200+ms, but it seems that windows will slow in the first opencl call.

@liyuming1978
Copy link
Author

liyuming1978 commented Jun 22, 2018

A conclusion:

  1. longtime of clCreatebuffer and clReleaseBuffer for mobilenet:
    openvino will call clCreatebuffer for each group, if group is 512, it will create 512 *2 buffer (weights and bias) , too much calls.

  2. longtime of clBuildProgram (must change to saved bianary)

3.high cpu cost (it time to optimize opencl driver or gpu driver)
queue.finish, event.wait, enqueueMapBuffer all are busywait, which call high cpu cost (>75%!!!). use set_ocl_callback to setEvent will cause high cpu cost in intel driver.. so I can only change the base_event::wait_impl() with
while (!is_set_impl()) { ::Sleep(1);} to replace _event.wait().
the app cpu cost now decrease from 50% to 10%. but system cpu cost is still high >17%. the system cpu cost I guess the reason is busywait in opencl driver.

@MichalMrozek
Copy link

Hello @liyuming1978 , I am from OpenCL driver team.
Our code works as you think, this is busy wait for some time, then it switches to non busy wait.
It is configurable with following registry keys:

DECLARE_DEBUG_VARIABLE(int32_t, OverrideEnableKmdNotify, -1, "-1: dont override, 0: disable, 1: enable")
DECLARE_DEBUG_VARIABLE(int32_t, OverrideKmdNotifyDelayMicroseconds, -1, "-1: dont override, 0: infinite timeout, >0: timeout in microseconds")
DECLARE_DEBUG_VARIABLE(int32_t, OverrideEnableQuickKmdSleep, -1, "-1: dont override, 0: disable, 1: enable. It works only when Kmd Notify is enabled.")
DECLARE_DEBUG_VARIABLE(int32_t, OverrideQuickKmdSleepDelayMicroseconds, -1, "-1: dont override, 0: infinite timeout, >0: timeout in microseconds")
DECLARE_DEBUG_VARIABLE(int32_t, OverrideEnableQuickKmdSleepForSporadicWaits, -1, "-1: dont override, 0: disable, 1: enable. It works only when QuickKmdSleep is enabled.")
DECLARE_DEBUG_VARIABLE(int32_t, OverrideDelayQuickKmdSleepForSporadicWaitsMicroseconds, -1, "-1: dont override, >0: timeout in microseconds")

It is also platform dependent, so different SKUs have different timers.
Looks like in your case the wait is not long enough to hit the non busy waiting loop.
If you are interested in building the OpenCL driver I can provide more deep explanation how to set those fields.

But be warned, this may decrease performance, due to completion latencies in non busy mode.

For long compilation time, we have an experimental feature in the driver called cl_cache.
It is not production quality yet ( use with caution ) , but if you are willing full to try it is very easy to do so.
Just create cl_cache directory in the directory where you execute the app.
It should be automatically populated with binaries which will be re-used in further iterations.

@liyuming1978
Copy link
Author

@MichalMrozek cl_cache works! that good to my patch.

for DECLARE_DEBUG_VARIABLE... dose it need re-build opencl driver? all just set register key? I use quick and small model to get best performance. and I notice it only happens in windows, not happen in ubuntu.

@MichalMrozek
Copy link

Windows is a bit different.
There non busy wait is only when you are on a battery. When driver is on AC then it never goes to non busy wait ( due to performance constrains ), there is a huge penalty on Windows to go to non busy wait.

Unfortunately Windows builds are not available from open source code so you will not be able to check those flags out and registry flags do not work with official drivers as they are disabled there.

@liyuming1978
Copy link
Author

liyuming1978 commented Jun 22, 2018

@MichalMrozek how to enable cl_cache in ubuntu? I use nuc in windows, so , always AC.. my email is [email protected] :)

@MichalMrozek
Copy link

cl_cache works the same way on Linux & Windows.
For Windows and non busy waits, I suggest to use throttle hints to create a command queue with CL_QUEUE_THROTTLE_LOW. This would inform the driver that power needs to be saved here and we will always initiate non busy wait in the driver.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants