-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start enabling the "generic" OpenCL vendor #2019
base: main
Are you sure you want to change the base?
Conversation
No functional change, but this puts the base classes first.
We're going to want these kernels for the generic OpenCL support anyway.
NFC but shows where the fault lines are.
XXX need to go back and promote INTEL_REF to REF
@@ -146,6 +146,7 @@ set(DNNL_ENABLE_PRIMITIVE_GPU_ISA "ALL" CACHE STRING | |||
at build time. Regardless of value chosen, reference OpenCL-based | |||
implementations will always be available. Valid values: | |||
- ALL (the default). Includes all ISA to be enabled. | |||
- NONE. Includes no ISAs, just the generic kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(random spot)
Thank you for taking the effort to enable a generic vendor for the OpenCL runtime. When it comes to external contributions that impact some aspects of the library's architecture a formal proposal (RFC) is required. Please make yourself familiar with the contribution guidelines and the RFC process.
While going through the changes in this pull-request I found a few things that have to be addressed. Your proposal is expected to cover it.
- According to the GPU code organization structure the generic OpenCL kernels should reside in
gpu/generic/ocl
. - The OpenCL primitive implementations that you marked as generic may be generic at this point but there is absolutely no guarantee that it will stay that way. The right way to make them generic is to move them to
gpu/generic/ocl
and enable them for the Intel and Generic vendors. - The library architecture requires that one engine per runtime (CPU and GPU) must be enabled. Currently, there is only one OpenCL engine that is Intel specific. When
DNNL_GPU_VENDOR
isGENERIC
andDNNL_GPU_RUNTIME
isOCL
there are no engines that could be used therefore a new engine for the generic vendor and OpenCL runtime has to be introduced. - There shouldn't any vendor specific compile time checks in a vendor specific space. All code that resides in
gpu/intel
assumes thatDNNL_GPU_VENDOR
isINTEL
. - When
DNNL_GPU_VENDOR
isINTEL
andDNNL_GPU_RUNTIME
isSYCL
thexpu/ocl
code must be enabled (the new condition that you introduced seems to break the rule). - I think the
DNNL_ENABLE_PRIMITIVE_GPU_ISA
option should be enabled (and defined) only for the Intel vendor and should be ignored otherwise - USM is not part of the OpenCL standard and therefore we cannot assume that all OpenCL vendors support it. The USM utility functions must be conditionally enabled for the vendors that have the support.
As for enabling OpenCL kernels for CPU, we don't have any plans to do that and the library doesn't have any architecture to do that. If you are interested in that you may come up with an architecture design and publish a proposal.
And the last thing, we plan to enable the generic vendor for SYCL runtime in the coming months. As an option you could wait to see what the design will look like to get a better understanding of what it should look like for OpenCL.
@@ -281,13 +282,13 @@ endif() | |||
|
|||
set(DNNL_GPU_VENDOR "NONE" CACHE STRING | |||
"When DNNL_GPU_RUNTIME is not NONE DNNL_GPU_VENDOR specifies target GPU | |||
vendor for GPU engines. Can be INTEL (default), NVIDIA or AMD.") | |||
vendor for GPU engines. Can be INTEL (default), GENERIC, NVIDIA or AMD.") | |||
|
|||
if(NOT DNNL_GPU_RUNTIME STREQUAL "NONE" AND DNNL_GPU_VENDOR STREQUAL "NONE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random spot to continue the discussion from #2044:
One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines.
This isn't quite accurate, the list is for potential primitive implementations. As things currently stand, each implementation contains its own dispatcher which can be used to implement more logic which can include calling different OpenCL kernels or even different primitive implementations.
I could easily imagine that iGPU and dGPU would want different sort orders entirely
We have not seen that need practice. Generally, performance differences like this end being handled by the dispatching logic within the primitive implementation.
I admit I don't love depending on a particular CLC compiler.
In general, most of the attributes we rely on enable generic programming in C, things like function overloading and automatic type inference. This was chosen to enable better type-safety as the alternative is using macros everywhere.
What tensions are you finding, and how are you trying to address them?
There are a lot of details to this question, and I will attempt to give a good summary here. To begin with, we currently use two dispatching models within primitive implementations. The first is based on hard-coded heuristics (example). The second method uses a performance model and an implementation list where the best scoring implementation under the model is used. All of the OpenCL kernels currently rely on the hard-coded heuristics. The biggest issue ss these dispatchers are fragile, any change (be it kernel, compiler, runtime, etc.) can cause the heuristics/model to be inaccurate. We have considered adding runtime tuning to avoid this, for example #1764, but concluded it is either too computationally expensive or requires a new API that cannot be used by customers. For the model based implementations, we generally need a tool to fit the model to observed performance, as quickly and accurately predicting performance from the hardware SKU is virtually impossible. As such, updates are relatively easy to handle, we just need to retrain the model (although this does induce code churn on the kernel/model database). On the other hand, hard-coded heuristics do not scale and require developer intervention.
Coming back to your goal of sharing OpenCL implementations, in order to make such a sharing feasible, we would need dispatching models to be easily generated after implementation changes. As such, this requires one of the following: switching current OpenCL kernels to performance based models, inventing a method to fit (the currently hard-coded) heuristics (at which point we are effectively using machine learning to optimize machine learning), or introducing a new methodology. On the other hand, all known methods likely require performance information from the target devices. At the same time, model information is unavailable in the generic context, so a method to handle this is required be it using default model or somehow training a custom model. I expect this to be a lot of work, just forking the existing implementations would be easier, so I guess this reduces to the question of whether the improved code sharing is worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the background, that's very helpful and certainly gives me some things to ponder. But one brief thing I do want to touch on:
so I guess this reduces to the question of whether the improved code sharing is worth the effort.
My sense is that there are likely multiple layers at which sharing might make sense. To draw an analogy from experience in Mesa, some classes of hardware commonly lack the same API features, so (for example) once one driver emulates geometry shaders via compute shaders, that emulation tends to get reused on other hardware with the same limitation. I think there's already some level of reusability from the parameterization of the existing CL kernels for eg. GRF size, that would map reasonably to other hardware. I think there is likely some level of reuse that makes sense for automatic dispatch tuning, even if that would presently be an open-ended research project.
But at a more basic level, the acts of enumerating devices, creating contexts and queues, fencing resources, and dispatching work do not fundamentally change across OpenCL device types. And, more importantly, OpenCL already provides interop APIs for managing those operations across devices and platforms. I feel like at least that much code sharing is self-evidently worth the effort, even if different sets of CL devices end up with widely divergent kernels and schedulers they will at least be able to communicate with each other.
This adds a GENERIC vendor to the set of choices for OpenCL, and treats it as the Intel OpenCL driver minus the gen-specific jit kernels. It still effectively requires the NEO driver at the moment, but the goal is to run on any OpenCL device that supports the necessary OpenCL version and extensions. It's not working yet but it's not far off:
The segfaults are where I killed the test because it ran for over an hour without completing. Some random observations and thoughts about where I'd like to go with this:
src/xpu
Feedback welcomed.