-
Notifications
You must be signed in to change notification settings - Fork 10
Beginning
There are a whole lot of features in Cekirdekler API so one may not need %90 of them in first place. To pick the best feature to begin with (for GPGPU beginners), here is a list of reasons tied to a list of features:
I have a kernel to try just for learning OpenCL and comparing its perfromance against naive single thread solutions
- Just start with a "hello world" and adjust it until it suits your compute requirements.
- Be sure to have an opencl supporting device such as
GPU
orCPU
and pick carefully that withAcceleratorType.CPU(or GPU)
or its newer versionHardware.ClPlatforms.all().gpus()[0]
as device parameter. - If you don't know anything about OpenCL, just start with only
AcceleratorType.CPU
or a newer way of picking itHardware.ClPlatforms.all().cpus()
first. - If you have more than one device, pick only one, using explicit device selection then apply it as parameter to hello world ClNumberCruncher constructor.
I have a kernel that is embarrassingly paralel on both data and workitems in its nature
- "Hello world" again. But this time you can enable features like
event driven pipelining
with a single true/false parameter in same compute() method, to make it faster by partitioning kernel into grains and separating read,write,compute parts of them if the device benefits from overlapped data movement+compute.
-
driver controlled pipelining
can also be activated with another true/false parameter near "enabling pipeline" parameter of compute() method. This leaves all overlapping behavior of grains to drivers of device.
- This time also multiple GPUs are better since there is load balancer working automatically. All devices share accordingly with their performances. Most performant device gets most percentage of work. This minimizes compute() latency for time-critical applications.
Multi GPUs can work with event/driver based pipelining too, as long as grain size and total size fits the requirements.
I need to build a custom pipeline with stages overlapped so I can squeeze latest bits of performance out of my GPU. Also my kernels are not separable to multiple GPUs because they have atomic functions and some rely on workgroup-id values.
- There is
SingleGPUPipeline
namespace inPipeline
namespace which is inCekirdekler
namespace. - Create buffers for all stages, create all stages, create pipeline by this namespace.
- Works with only a single device.
- Increases througput but also increases "keyboard to pixel" latency. So is better for batch computing.
I have several kernels such that they can't be partitioned into grains nor multiple GPUs yet I need all my GPUs work at the same time. How to keep GPUs busy so total work is done quicker?
- Either use single gpu pipeline again but for all GPUs and add a data moving mechanism yourself to connect multiple pipelines or read the next option below.
- If you have multiple devices, use
Pipeline
namespace for adevice to device pipeline
feature to have data flow from host to a gpu then to another gpu and back to host finally. This at least lets one use all GPUs when there are multiple kernels that send data to next kernel and not separable in terms of workitems.
I don't want the API to synchronize data between host and device at every single kernel for me. I want all kernel compute() instances handled at once and need to deliver my own queue scheme for a specific task. I have only a single device and I don't need driver/event pipelining.
- Enable
enqueue mode
from the ClNumberCruncher instance - (optional) Enable
no compute mode
from ClNumberCruncher instance ---> this is async - (optional) Do any compute() (for only data upload and data download)---> this is async
- (optional) Disable
no compute mode
from ClNumberCruncher instance---> this is async - Do any number of compute() (for data upload, compute, data download)---> this is async
- (optional) Enable
async enqueue mode
from ClNumberCruncher instance---> this is async - (optional) Do any compute() (compute on another queue in parallel)---> this is async
- (optional) Disable
async enqueue mode
from ClNumberCruncher instance---> this is async - Do some host side C# work ---> inherently async to GPU
- Disable
enqueue mode
from ClNumberCruncher instance (this blocks until all finished)
I have a lot of non-separable(with atomic functions,...) kernels each with different buffer parameters(or at least working on different ranges of a buffer) and also have multiple GPUs. Kernels may have same name or not, they also don't need to be synchronized before any other kernel. How can I schedule kernels to right devices for minimizing time taken to compute all? Long story short, I need something like "tiled rendering" but just for computing with multiple GPUs to get irresponsible amounts of performance.
- Create
ClTask
instances with.task()
method instead of using.compute()
directly. - Add all created tasks to
ClTaskPool
. - Create
ClDevicePool
- Add all devices to
ClDevicePool
- Feed all
ClTaskPool
instances toClDevicePool
- Do some asynchronous C# work
- Call synchronization method of
ClDevicePool
There is some work to do but not sure about how many workers are needed without parsing the data or computing first. I need the work scheduling handled by the GPU itself and I have an OpenCL 2.0 capable graphics card.
- Make sure to get cekirdekler version 1.4.1+
- Write OpenCL kernels that use dynamic parallelism commands such as
enqueue_kernel()
andget_default_queue()
.
I need to upload data to GPU, repeat a kernel N times(with same global/local size), then download results to C#.
- Set "Repeat count" property of ClNumberCruncher to a desired value.
- compute() // blocks
- Reset repeat count if not needed to repeat anymore.
- also usable with enqueue mode
- also usable on a series of different kernels: "kernel1 kernel2 kernel3" gets repeated with same order N times
- if you need a counting/resetting kernel added to end of each iteration of repeats, set "repeat kernel name" field to a kernel name. Then repeating 3 consecutive kernels becomes "kernel1 kernel2 kernel3 repeatKernel kernel1 kernel2 kernel3 repeatKernel kernel1 kernel2 ...."
- copying same kernel name in string parameter(separated with spaces, commas, ..) is also equivalent to repeating it but somewhat less efficient for some machines with less string performance.
I have C# arrays with primitive elements such as floats and bytes but I need to use them directly.
-
To start compute(), first array parameter has to be a
ClArray<Type>
and it can be initialized as:byte [] myArray = new byte[1024]; ClArray<byte> myWrapper = myArray; // binds itself to C# array, pins it for compute
-
Then you can add additional parameters of pure C# type as you wish before compute():
byte[] myArray = new byte[1024]; ClArray<byte> myWrapper = myArray; float[] myArray2 = new float[1024]; myWrapper.nextParam(myArray2).compute();
here myArray2 could be of
float[], byte[], ..., ClArray<float>
and multiple of them can be added at once. -
Using C# arrays directly has some performance pitfalls such as making extra buffer copies between host and host before going to device. Creating a ClArray instance with "new" uses C++ arrays internally and not only does less copies when possible, but also is aligned to 4096-byte boundaries for even faster access.
-
Single device pipeline
feature can get pure C# arrays as buffers without using any ClArray, for each stage.
I don't need C# arrays, getting just a few result values after a compute is enough for me. Maximum buffer access performance between host and device, is what I need.
-
There is a
streaming data
feature perexplicit device slection
operation as a parameter. Also there is samestreaming data
parameter per ClNumberCruncher when no explicit device selection is used. These enable device-side streaming compliance. To enable this for buffers too, a ClArray instance with C++ array allocation(default with "new" or ClFloatArray, ClByteArray, ...) is needed and itszero copy
field must be set before starting compute(). Zero copy means a GPU workitem will directly access RAM without copying anything. This is the fastest access mode and device-size streaming is enabled automatically for CPUs and integrated-GPUs(if they share same RAM with CPU).// when zero copy is enabled from buffer and streaming is enabled from device // zero copy capable (equivalent to CL_MEM_USE_HOST_PTR) ClArray<byte> myWrapper = new ClArray<byte>(1024); // zero copy capable (equivalent to CL_MEM_USE_HOST_PTR) ClArray<byte> myWrapper2 = new ClByteArray(1024); // slower, (equivalent to CL_MEM_ALLOC_HOST_PTR) // but still faster than non-streamed array // if data is accessed only once or twice per kernel ClArray<byte> myWrapper2 = new byte[1024]; // when zero copy is disabled from buffer or streaming is disabled from device // copies, fast (equivalent to CL_MEM_READ_WRITE), faster when data is accessed multiple times ClArray<byte> myWrapper = new ClArray<byte>(1024); // copies, fast (equivalent to CL_MEM_READ_WRITE), faster when data is accessed multiple times ClArray<byte> myWrapper2 = new ClByteArray(1024); // copies, slowest(equivalent to CL_MEM_READ_WRITE) , faster when data is accessed multiple times ClArray<byte> myWrapper2 = new byte[1024];
-
Enable zero copy only when data is accessed once or twice per kernel. Disable it (that enables dedicated device memory usage for buffers), to access it multiple times faster.
-
One example for streaming(with zero copy) is c=a+b kernel. An element is accessed only once and embarrasingly parallel. One example for non streaming(with extra copies) is "array sort" kernel sequence which uses an array many times (Log(N) times)
-
If an array is meant to be read-only, enable readOnly property of ClArray before any compute(), to create its device buffer as
CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY
which implies host will only write to it, kernel will only read from it. If an array is meant to be write-only, enable writeOnly property on ClArray to gainCL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY
flags on device side buffers.
(images in this page were made in "schemeit" of www.digikey.com )