diff --git a/rfcs/20240325-reorganize-gpu/README.md b/rfcs/20240325-reorganize-gpu/README.md new file mode 100644 index 00000000000..8bf75ebd67b --- /dev/null +++ b/rfcs/20240325-reorganize-gpu/README.md @@ -0,0 +1,804 @@ +# Refactor GPU Architecture + +## Background + +The current design of the library architecture responsible for GPU was +built with Intel in mind and therefore is Intel centric. Because of that, +the basic GPU abstractions (e.g. `compute_engine`, etc) are tightly tied +to OpenCL and nGEN and other Intel specifics such as information about +device architecture, stepping, etc. Such design had been working fine +up until support for NVIDIA (and later AMD) GPUs was introduced. Currently, +the NVIDIA and AMD specific abstractions are built on top of the basic GPU +abstractions and therefore have dependencies on OpenCL and nGEN even though +there is no need in them. Furthermore, oneDNN now has generic SYCL +kernels that can be used on a variety of different architectures +that are supported by the SYCL ecosystem. The SYCL kernels also use +abstractions that are built on top of the basic GPU abstractions and +therefore have the same issue. + +This RFC proposes a new organization of GPU kernels and abstractions +to clearly separate independent functionality, get rid of unnecessary +dependencies and have a flexible enough architecture to make adding support +for new vendors easier. + +## Proposal + +Reorganize the GPU kernels and abstractions according the the following +schema: Vendor / Technology + +With this schema the GPU directory will have subdirectories that +correspond to the vendors: `intel`, `nvidia`, `amd`, `generic`, etc. +Each of the subdirectories may have technology specific sub-subdirectories: +`sycl`, `ocl`, `jit`, etc. + +Pros: +* The schema provides enough flexibility to enable new vendors and extend +the already supported ones +* Clustering the functionality and abstractions around vendors is +convenient as a vendor can share the same functionality/abstractions across +different technologies (e.g. the compute layer for Intel vendor) +* Currently, functionality and abstractions are fully or partially +clustered around vendors therefore the new schema should not cause a lot of +confusion among the developers, it should also positively affect the +implementation cost +* The schema also provides sufficient configurability, e.g +the generic vendor can be enabled along with nvidia or amd, or it can be +enabled individually + +```bash +├── common/ # Common files for the library (e.g. API, basic abstractions, etc). +├── xpu/ # xpu stands for Heterogeneous Runtime. All vendor agnostic code for runtimes resides here. +│ ├── sycl/ # Vendor agnostic code for SYCL runtime. +│ ├── ocl/ # Vendor agnostic code for OpenCL runtime. +│ └── ... +├── cpu/ # CPU code. +│ ├── sycl/ # CPU specific SYCL code. +│ └── ... +└── gpu/ # GPU code. + ├── intel/ # Intel vendor specific code. + │ ├── compute/ # Compute layer abstractions. + │ ├── ocl/ # OpenCL code. + │ ├── jit/ # JIT (nGEN) code. + │ ├── sycl/ # SYCL code. + │ │ ├── l0/ # Level zero backend specific code. + │ │ ├── ocl/ # OpenCL backend specific code. + │ │ └── ... + │ └── ... + ├── nvidia/ # NVIDIA vendor specific code. + ├── amd/ # AMD vendor specific code. + └── generic/ # Generic (vendor agnostic) GPU kernels. + ├── sycl/ # SYCL kernels. + └── ... +``` + +### Vendor Macros + +oneDNN can be built for a number of GPU vendors, such as Intel, Nvidia and AMD. +Adding support for generic GPU vendor is in progress. In order to be able to dispatch +between GPU vendor specific code at compile time the corresponding macros will be added. + +```cpp +// dnnl_config.hpp.in + +/// No vendor (corresponding runtime is disabled) +#define DNNL_VENDOR_NONE 0u + +/// Intel vendor +#define DNNL_VENDOR_INTEL 1u + +/// NVIDIA vendor +#define DNNL_VENDOR_NVIDIA 2u + +/// AMD vendor +#define DNNL_VENDOR_AMD 4u + +// oneDNN GPU vendor +#cmakedefine DNNL_GPU_VENDOR DNNL_VENDOR_${DNNL_GPU_VENDOR} + +// ... +``` + +### Namespaces and Prefixes + +All vendor-specific code should be enclosed in a namespace that has the vendor's name. +Based on the directory structure described above the following namespaces will be +introduced: +* `dnnl::impl` + * `dnnl::impl::xpu::sycl` + * `dnnl::impl::xpu::ocl` +* `dnnl::impl::cpu` + * `dnnl::impl::cpu::sycl` +* `dnnl::impl::gpu` + * `dnnl::impl::gpu::intel` + * `dnnl::impl::gpu::intel::compute` + * `dnnl::impl::gpu::intel::ocl` + * `dnnl::impl::gpu::intel::jit` + * `dnnl::impl::gpu::nvidia` + * `dnnl::impl::gpu::amd` + * `dnnl::impl::gpu::generic` + * `dnnl::impl::gpu::generic::sycl` + +Given that the namespaces already prevent name collisions adding the prefixes +such as `sycl_`, `ocl_`, etc is redundant and therefore the suggestion is to +drop the prefixes. For example, `dnnl::impl::gpu::ocl::ocl_gpu_engine_t` will +be converted to `dnnl::impl::gpu::intel::ocl::engine_t`. + +### Implementation List + +Currently there are separate implementation lists for each vendor, which +introduces unnecessary redundancy and increases maintenance cost. The proposal +is to unify the implementation lists into a single one. The CPU primitives already +use the approach. + +There will be introduced a set of new macros to instantiate vendor specific and +generic implementations. + +The macros for INTEL, NVIDIA and AMD vendors assume that all implementations +within a single vendor can be enabled at once regardless of the kernel language. + +The macros for the GENERIC vendor can be either truly generic or kernel language specific: +* Truly generic implementation is the one that is not tied to any vendor and kernel language, e.g. + an implementation of the concat primitive based on reorders +* Runtime specific implementation is the one that is tied to a particular kernel language, + e.g. SYCL generic implementations (written in generic SYCL) + + The concat, sum and reorder primitives require specialized versions of some of the the + macros because their `pd_t::create` functions have unique signatures. + +The following macros are the helper ones and not expected to be used directly: +* GPU_INSTANCE - primary macro to instantiate a GPU implementation +* GPU_CONCAT_INSTANCE - a specialization for the concat primitive +* GPU_SUM_INSTANCE - a specialization for the sum primitive +* GPU_REORDER_INSTANCE - a specialization for the reorder primitive + +The following are the _primary_ vendor specific macros that are used to instantiate vendor specific and generic implementations. +* GPU_INSTANCE_INTEL - instantiate implementations for the Intel vendor +* GPU_INSTANCE_NVIDIA - instantiate implementations for the Nvidia vendor +* GPU_INSTANCE_AMD - instantiate implementations for the AMD vendor +* GPU_INSTANCE_GENERIC_SYCL - instantiate implementations written in SYCL for the generic vendor (currently enabled only for Nvidia vendor) +* GPU_INSTANCE_GENERIC - instantiate implementations that a truly generic. Truly generic + implementation is the one that is not tied to any vendor and runtime, e.g. an implementation + of the concat primitive based on reorders + +The following are the specializations of the vendor specific macros for the concat primitive: +* GPU_CONCAT_INSTANCE_INTEL +* GPU_CONCAT_INSTANCE_NVIDIA +* GPU_CONCAT_INSTANCE_AMD +* GPU_CONCAT_INSTANCE_GENERIC_SYCL +* GPU_CONCAT_INSTANCE_GENERIC + +The following are the specializations of the vendor specific macros for the sum primitive: +* GPU_SUM_INSTANCE_INTEL +* GPU_SUM_INSTANCE_NVIDIA +* GPU_SUM_INSTANCE_AMD +* GPU_SUM_INSTANCE_GENERIC_SYCL +* GPU_SUM_INSTANCE_GENERIC + +The following are the specializations of the vendor specific macros for the reorder primitive: +* GPU_REORDER_INSTANCE_INTEL +* GPU_REORDER_INSTANCE_NVIDIA +* GPU_REORDER_INSTANCE_AMD +* GPU_REORDER_INSTANCE_GENERIC_SYCL +* GPU_REORDER_INSTANCE_GENERIC + +### Affected Basic Abstractions + +The new schema will require moving a lot of parts of the library around +and while most of the changes are probably just an implementation detail +there are a few major changes that are worth describing in this RFC. + +#### Engine + +The engine abstraction gets affected by the changes the most. + +There is a `compute_engine_t` abstraction that serves as a base class for +`sycl_engine_base_t` and `ocl_gpu_engine_t` classes. The problem is that each +vendor specific SYCL engine has to be inherited from `sycl_engine_base_t`and +therefore it cannot be derived from `compute_engine_t`. The solution to the problem +is to decouple the SYCL specific but vendor agnostic part of the `sycl_engine_base_t` +class and move it over to a new class. The `ocl_gpu_engine_t` class doesn't have +this particular problem but it makes sense to also decouple the OpenCL specific but +vendor agnostic part of it and move it over to the new class. + +The new class is defined as follows: +```cpp +// Location: src/common +namespace dnnl::impl { + +struct engine_impl_t { + engine_impl_t() = delete; + // Constructs an impl::engine_impl_t instance using the provided engine kind, runtime kind and index. + engine_impl_t(engine_kind_t kind, runtime_kind_t runtime_kind, size_t index); + virtual ~engine_impl_t() = default; + + // Returns engine kind. + engine_kind_t kind() const; + // Returns runtime kind. + runtime_kind_t runtime_kind() const; + // Returns index. + size_t index() const; + + // Returns engine ID. + virtual engine_id_t engine_id() const; + +#ifdef ONEDNN_BUILD_GRAPH + // Returns allocator. + void *get_allocator() const; + // Sets allocator. + void set_allocator(graph::allocator_t *alloc); +#endif + // Initializes the engine implementation. + virtual status_t init(); + // Creates a stream implementation. + virtual status_t create_stream_impl(impl::stream_impl_t **stream_impl, unsigned flags) const; + // Returns a buffer alignment. + virtual int get_buffer_alignment() const; + +private: + engine_kind_t kind; + runtime_kind_t runtime_kind; + size_t index; + +#ifdef ONEDNN_BUILD_GRAPH + graph::allocator_t allocator; +#endif +}; + +} // namespace dnnl::impl +``` + +```cpp +// Location: src/xpu/ocl +namespace dnnl::impl::xpu::ocl { + +struct engine_impl_t : public dnnl::impl::engine_impl_t { + engine_impl_t() = delete; + // Constructs an xpu::ocl::engine_impl_t instance using the provided OpenCL device, OpenCL context and index + engine_impl_t(cl_device_id device, cl_context context, size_t index); + ~engine_impl_t() override = default; + + // Initializes the engine implementation. + status_t init() override; + + // Creates a memory storage. + status_t create_memory_storage(memory_storage_t **storage, engine_t *engine, + unsigned flags, size_t size, void *handle) const; + + // Creates a memory storage. + status_t create_stream_impl( + impl::stream_impl_t **stream_impl, unsigned flags) const override; + + // Returns OpenCL device. + cl_device_id device() const; + // Returns OpenCL context. + cl_context context() const; + // Returns OpenCL platform. + cl_platform_id platform(); + // Returns engine ID. + engine_id_t engine_id() const override; + // Returns device name. + const std::string &name() const; + // Returns runtome version. + const runtime_version_t &runtime_version() const; + // Returns buffer alignment. + int get_buffer_alignment() const override; + +private: + xpu::ocl::wrapper_t device; + xpu::ocl::wrapper_t context; + cl_platform_id platform; + bool is_user_context; + + std::string name; + runtime_version_t runtime_version; +}; + +} // namespace dnnl::impl::xpu::ocl +``` + +```cpp +// Location: src/xpu/sycl +namespace dnnl::impl::xpu::sycl { + +struct engine_impl_t : public dnnl::impl::engine_impl_t { + engine_impl_t() = delete; + // Constructs an xpu::sycl::engine_impl_t instance using the provided SYCL device, SYCL context and index. + engine_impl_t(engine_kind_t kind, const ::sycl::device &device, + const ::sycl::context &context, size_t index); + ~engine_impl_t() override = default; + + // Initializes the engine implementation. + status_t init() override; + + // Creates a memory storage. + status_t create_memory_storage(memory_storage_t **storage, engine_t *engine, + unsigned flags, size_t size, void *handle) const; + + // Creates stream implementation. + status_t create_stream_impl( + impl::stream_impl_t **stream_impl, unsigned flags) const override; + + // Returns SYCL device. + const ::sycl::device &device() const; + // Returns SYCL context. + const ::sycl::context &context() const; + // Returns backend. + backend_t backend() const; + // Returns engine ID. + engine_id_t engine_id() const override; + // Returns device name. + const std::string &name() const; + // Returns runtime version. + const runtime_version_t &runtime_version() const; + // Returns buffer alignment. + int get_buffer_alignment() const override; + // Returns true if system memory allocations are supported. + bool mayiuse_system_memory_allocators() const; + +private: + ::sycl::device device; + ::sycl::context context; + + backend_t backend; + + std::string name; + runtime_version_t runtime_version; +}; + +} // namespace dnnl::impl::xpu::sycl +``` + +When it comes to the generic SYCL kernels there are 3 scenarios that should be +supported: +* Only generic kernels are enabled +* NVIDIA and generic kernels are enabled +* AMD and generic kernels are enabled + +According to the oneDNN programming model only 1 GPU engine can be enabled at a time. +In the second and third scenarios the NVIDIA and AMD specific engines will be used +for the generic SYCL kernels. In order to support the first scenario a separate, new +engine is required. As a result the generic SYCL kernels will be used with different +engines. To make it work there will be added a new `dnnl::impl::gpu::engine_t` class +that will hold `dnnl::impl::engine_impl_t` that will point to `dnnl::impl::sycl::engine_impl_t` +in this particular case. The `dnnl::impl::gpu::engine_t` will also provide +an interface to return a pointer to `dnnl::impl::engine_impl_t` so that the generic +SYCL kernels can query all SYCL specific information they may need (e.g. supported +sub-group sizes). + +The base `engine_t` class contains a pointer to the engine implementation and interface +to initialize it: +```cpp +// Location: src/common +namespace dnnl::impl { +struct engine_t : dnnl::impl::c_compatible { + // Each engine creates an engine implementation and passes it down to the base engine class. + engine(dnnl::impl::engine_impl_t *impl) : impl(impl), ... {} + + // Returns a pointer to the engine implementation. + const dnnl::impl::engine_impl_t *impl() const; + +protected: + // Each engine has to call `init_impl()` to initialize the implementation. + dnnl::impl::status_t init_impl(); +private: + // Points to a particular engine implementation. + std::unique_ptr impl; +}; +``` + +The `dnnl::impl::gpu::engine_t` is defined as follows: +```cpp +// Location: src/gpu +namespace dnnl::impl::gpu { + +struct engine_t : public dnnl::impl::engine_t { + // Interfaces to get implementation lists. + const impl_list_item_t *get_reorder_implementation_list( + const memory_desc_t *src_md, const memory_desc_t *dst_md) const override; + const impl_list_item_t *get_concat_implementation_list() const override; + const impl_list_item_t *get_sum_implementation_list() const override; + const impl_list_item_t *get_implementation_list(const op_desc_t *desc) const override; + + // Returns an alignment that should be used for memory allocations on the GPU. + int get_buffer_alignment() const; + + // Return a service stream. + status_t get_service_stream(impl::stream_t *&stream) override; + +private: + std::unique_ptr service_stream; +}; + +} // namespace dnnl::impl::gpu +``` + +The following vendor specific engine classes will replace the currently +implemented ones. + +This class takes over responsibility of `compute_engine_t` class. +```cpp +// Location: src/gpu/intel +namespace dnnl::impl::gpu::intel { + +struct engine_t : public dnnl::impl::gpu::engine_t {}; + +} // namespace dnnl::impl::gpu::intel +``` + +This class takes over responsibility of `sycl_engine_base_t` class. +```cpp +// Location: src/gpu/intel/sycl +namespace dnnl::impl::gpu::intel::sycl { + +struct engine_t : public dnnl::impl::gpu::intel::engine_t { +protected: + // Convenience interface to simplify access to `impl` within the class. + const dnnl::impl::xpu::sycl::engine_impl_t *impl() const; +}; + +} // namespace dnnl::impl::gpu::intel::sycl +``` + +This class takes over responsibility of `ocl_gpu_engine_t` class. +```cpp +// Location: src/gpu/intel/ocl +namespace dnnl::impl::gpu::intel::ocl { + +struct engine_t : public dnnl::impl::gpu::intel::engine_t { +protected: + // Convenience interface to simplify access to `impl` within the class. + const dnnl::impl::xpu::ocl::engine_impl_t *impl() const; +}; + +} // namespace dnnl::impl::gpu::intel::ocl +``` + +This class takes over responsibility of `sycl_cuda_engine_t` and `sycl_hip_engine_t` classes. +```cpp +// Location: src/gpu/nvidia +namespace dnnl::impl::gpu::nvidia { + +struct engine_t : public dnnl::impl::gpu::engine_t { +protected: + // Convenience interface to simplify access to `impl` within the class. + const dnnl::impl::xpu::sycl::engine_impl_t *impl() const; +}; + +} // namespace dnnl::impl::gpu::nvidia + + +// Location: src/gpu/amd +namespace dnnl::impl::gpu::amd { + +struct engine_t : public dnnl::impl::gpu::engine_t { +protected: + // Convenience interface to simplify access to `impl` within the class. + const dnnl::impl::xpu::sycl::engine_impl_t *impl() const; +}; + +} // namespace dnnl::impl::gpu::amd +``` + +This class will be used when only generic SYCL kernels are enabled. +```cpp +// Location: src/gpu/generic +namespace dnnl::impl::gpu::generic { + +struct engine_t : public dnnl::impl::gpu::engine_t { +protected: + // Convenience interface to simplify access to `impl` within the class. + const dnnl::impl::xpu::sycl::engine_impl_t *impl() const; +}; + +} // namespace dnnl::impl::gpu::generic + +``` + +The present inheritance chains for SYCL and OpenCL GPU engines are +the following: +* SYCL Intel and generic: `engine_t` -> `compute_engine_t` -> `sycl_engine_base_t` -> `sycl_gpu_engine_t` +* SYCL NVIDIA: `engine_t` -> `compute_engine_t` -> `sycl_engine_base_t` -> `sycl_cuda_engine_t` +* SYCL AMD: `engine_t` -> `compute_engine_t` -> `sycl_engine_base_t` -> `sycl_hip_engine_t` +* OpenCL (only Intel): `engine_t` -> `compute_engine_t` -> `ocl_gpu_engine_t` + +The new inheritance chains for SYCL and OpenCL GPU engines are +the following: +* SYCL Generic: `engine_t` -> `gpu::engine_t` -> `gpu::generic::engine_t` +* SYCL Intel: `engine_t` -> `gpu::engine_t` -> `gpu::intel::engine_t` -> `gpu::intel::sycl::engine_t` +* SYCL NVIDIA: `engine_t` -> `gpu::engine_t` -> `gpu::nvidia::engine_t` +* SYCL AMD: `engine_t` -> `gpu::engine_t` -> `gpu::amd::engine_t` +* OpenCL: `engine_t` -> `gpu::engine_t` -> `gpu::intel::engine_t` -> `gpu::intel::ocl::engine_t` + +Reminder: the `gpu::engine_t` holds an `engine_impl_t` pointer that points to either +`sycl::engine_impl_t` or `ocl::engine_impl_t`. + +#### Stream + +Similar to the engine, the following stream abstractions will be introduced. + +```cpp +// Location: src/common +namespace dnnl::impl { + +struct stream_impl_t { + stream_impl_t() = delete; + // Constructs an impl::stream_impl_t instance using the provided stream flags. + stream_impl_t(unsigned flags); +#if DNNL_CPU_RUNTIME == DNNL_RUNTIME_THREADPOOL + // Constructs an impl::stream_impl_t instance using the provided threadpool. + stream_impl_t(threadpool_interop::threadpool_iface *threadpool); +#endif + virtual ~stream_impl_t() = default; + + // Returns stream flags. + unsigned flags() const; + // Returns true if the profiling flag is set. + bool is_profiling_enabled() const; + +#if DNNL_CPU_RUNTIME == DNNL_RUNTIME_THREADPOOL + // Returns threadpool. + status_t get_threadpool(threadpool_interop::threadpool_iface **threadpool) const; +#endif + +private: + unsigned flags; +#if DNNL_CPU_RUNTIME == DNNL_RUNTIME_THREADPOOL + threadpool_interop::threadpool_iface *threadpool; +#endif +}; + +} // namespace dnnl::impl +``` +```cpp +// Location: src/xpu/ocl +namespace dnnl::impl::xpu::ocl { + +struct stream_impl_t : public dnnl::impl::stream_impl_t { + stream_impl_t() = delete; + // Constructs an xpu::ocl::stream_impl_t instance using the provided stream flags. + stream_impl_t(unsigned flags); + // Constructs an xpu::ocl::stream_impl_t instance using the provided OpenCL queue and stream flags. + stream_impl_t(cl_command_queue queue, unsigned flags); + ~stream_impl_t() override; + + // Sets OpenCL queue. + status_t set_queue(cl_command_queue queue); + + // Returns OpenCL queue. + cl_command_queue queue(); + + // Wait on the queue. + status_t wait(); + + // Copies data from src to dst. + status_t copy(impl::stream_t *stream, const memory_storage_t &src, + const memory_storage_t &dst, size_t size, const xpu::event_t &deps, + xpu::event_t &out_dep, + xpu::stream_profiler_t *stream_profiler = nullptr); + + // Fills dst with pattern. + status_t fill(impl::stream_t *stream, const memory_storage_t &dst, + uint8_t pattern, size_t size, const xpu::event_t &deps, + xpu::event_t &out_dep, + xpu::stream_profiler_t *stream_profiler = nullptr); + + // Returns oneDNN OpenCL context. + const xpu::ocl::context_t &ocl_ctx() const; + // Returns oneDNN OpenCL context. + xpu::ocl::context_t &ocl_ctx(); + // Returns oneDNN XPU context. + xpu::context_t &ctx(); + // Returns oneDNN XPU context. + const xpu::context_t &ctx() const; + + // Returns output OpenCL event. + const xpu::ocl::wrapper_t &get_output_event() const; + + // Initializes flags. + static status_t init_flags(unsigned *flags, cl_command_queue queue); + +private: + cl_command_queue queue; + mutable utils::thread_local_storage_t ctx; +}; + +} // namespace dnnl::impl::xpu::ocl +``` +```cpp +// Location: src/xpu/sycl +namespace dnnl::impl::xpu::sycl { + +struct stream_impl_t : public dnnl::impl::stream_impl_t { + stream_impl_t() = delete; + // Constructs an xpu::sycl::stream_impl_t instance using the provided stream flags. + stream_impl_t(unsigned flags); + // Constructs an xpu::sycl::stream_impl_t instance using the provided SYCL queue and stream flags. + stream_impl_t(const ::sycl::queue &queue, unsigned flags); + ~stream_impl_t() override = default; + + // Sets SYCL queue. + status_t set_queue(::sycl::queue queue); + + // Returns SYCL queue. + ::sycl::queue *queue(); + + // Wait on the queue. + status_t wait(); + + // Copies data from src to dst. + status_t copy(impl::stream_t *stream, const memory_storage_t &src, const memory_storage_t &dst, + size_t size, const xpu::event_t &deps, xpu::event_t &out_dep, + xpu::stream_profiler_t *stream_profiler = nullptr); + // Fills dst with pattern. + status_t fill(const memory_storage_t &dst, uint8_t pattern, size_t size, const xpu::event_t &deps, + xpu::event_t &out_dep, xpu::stream_profiler_t *stream_profiler = nullptr); + + // Returns oneDNN SYCL context. + const xpu::sycl::context_t &sycl_ctx() const; + // Returns oneDNN SYCL context. + xpu::sycl::context_t &sycl_ctx(); + // Returns oneDNN XPU context. + xpu::context_t &ctx(); + // Returns oneDNN XPU context. + const xpu::context_t &ctx() const; + + // Returns output SYCL event. + ::sycl::event get_output_event(); + + // Registers dependency. + void register_deps(::sycl::handler &cgh) const; + + // Initializes flags. + static status_t init_flags(unsigned *flags, ::sycl::queue &queue); + + // Returns a dummy accessor. + template <::sycl::access_mode mode> + ::sycl::accessor get_dummy_accessor( + ::sycl::handler &cgh); + +private: + std::unique_ptr<::sycl::queue> queue; + mutable utils::thread_local_storage_t ctx; + // XXX: this is a temporary solution to make sycl_memory_arg_t + // default constructible. + xpu::sycl::buffer_u8_t dummy_buffer = xpu::sycl::buffer_u8_t(1); +}; + +} // namespace dnnl::impl::xpu::sycl +``` +The `dnnl::impl::stream_t` is modified to contain a pointer to `impl::stream_impl_t`: +```cpp +// Location: src/common +namespace dnnl::impl { + +struct stream_t : public impl::c_compatible { + // Constructs an impl::stream_t instance using the provided engine and stream implementation. + // This constructor substitutes the existing one. + stream_t(dnnl::impl::engine_t *engine, dnnl::impl::stream_impl_t *impl); + + // Returns a non-const pointer to the stream implementation. + impl::stream_impl_t *impl() { return impl.get(); } +protected: + std::unique_ptr impl; +}; + +} // namespace dnnl::impl +``` + +The `dnnl::impl::gpu::stream_t` is defined as follows: + +```cpp +// Location: src/gpu +namespace dnnl::impl::gpu { + +struct stream_t : public dnnl::impl::stream_t { + // Copies data from src to dst. + virtual status_t copy(const memory_storage_t &src, + const memory_storage_t &dst, size_t size, const xpu::event_t &dep, + xpu::event_t &out_dep) + = 0; + // Fills dst with pattern. + virtual status_t fill(const memory_storage_t &dst, uint8_t pattern, + size_t size, const xpu::event_t &deps, xpu::event_t &out_dep) + = 0; + + // Returns oneDNN XPU context. + virtual xpu::context_t &ctx() = 0; + // Returns oneDNN XPU context. + virtual const xpu::context_t &ctx() const = 0; + + // Returns stream profiler. + virtual const xpu::stream_profiler_t &profiler() const; + // Returns stream profiler. + xpu::stream_profiler_t &profiler(); + + // Returns frequency. + virtual double get_freq(const xpu::event_t &event) const { return 0.0; } + +protected: + std::unique_ptr profiler; +}; + +} // namespace dnnl::impl::gpu +``` + +The following vendor specific stream classes will replace the currently +implemented ones. The vendor specific streams are responsible for initializing +the stream implementation. + +This class takes over responsibility of `compute_stream_t` class. +```cpp +// Location: src/gpu/intel +namespace dnnl::impl::gpu::intel { + +struct stream_t : public dnnl::impl::gpu::stream_t {}; + +} // namespace dnnl::impl::gpu::intel +``` + +This class takes over responsibility of `sycl_stream_t` class. +```cpp +// Location: src/gpu/intel/sycl +namespace dnnl::impl::gpu::intel::sycl { + +struct stream_t : public dnnl::impl::gpu::intel::stream_t {}; + +} // namespace dnnl::impl::gpu::intel::sycl +``` + +This class takes over responsibility of `ocl_stream_t` class. +```cpp +// Location: src/gpu/intel/ocl +namespace dnnl::impl::gpu::intel::ocl { + +struct stream_t : public dnnl::impl::gpu::intel::stream_t {}; + +} // namespace dnnl::impl::gpu::intel::ocl +``` + +This class takes over responsibility of `sycl_cuda_stream_t` and `sycl_hip_stream_t` classes. +```cpp +// Location: src/gpu/nvidia +namespace dnnl::impl::gpu::nvidia { + +struct stream_t : public dnnl::impl::gpu::stream_t {}; + +} // namespace dnnl::impl::gpu::nvidia + + +// Location: src/gpu/amd +namespace dnnl::impl::gpu::amd { + +struct stream_t : public dnnl::impl::gpu::stream_t {}; + +} // namespace dnnl::impl::gpu::amd +``` + +This class will be used when only generic SYCL kernels are enabled. +```cpp +// Location: src/gpu/generic +namespace dnnl::impl::gpu::generic { + +struct stream_t : public dnnl::impl::gpu::stream_t {}; + +} // namespace dnnl::impl::gpu::generic + +``` + +The current inheritance chains for SYCL and OpenCL GPU streams are the following: +* SYCL Intel and genetic: `stream_t` -> `compute_stream_t` -> `sycl_stream_t` +* SYCL NVIDIA: `stream_t` -> `compute_stream_t` -> `sycl_stream_t` -> `sycl_cuda_stream_t` +* SYCL AMD: `stream_t` -> `compute_stream_t` -> `sycl_stream_t` -> `sycl_hip_stream_t` +* OpenCL: `stream_t` -> `compute_stream_t` -> `ocl_stream_t` + +The new inheritance chains for SYCL and OpenCL GPU streams are +the following: +* SYCL Generic: `stream_t` -> `gpu::stream_t` -> `gpu::generic::stream_t` +* SYCL Intel: `stream_t` -> `gpu::stream_t` -> `gpu::intel::stream_t` -> `gpu::intel::sycl::stream_t` +* SYCL NVIDIA: `stream_t` -> `gpu::stream_t` -> `gpu::nvidia::stream_t` +* SYCL AMD: `stream_t` -> `gpu::stream_t` -> `gpu::amd::stream_t` +* OpenCL: `stream_t` -> `gpu::stream_t` -> `gpu::intel::stream_t` -> `gpu::intel::ocl::stream_t` + +Reminder: the `gpu::stream_t` holds an `stream_impl_t` pointer that points to either +`sycl::stream_impl_t` or `ocl::stream_impl_t`. +