Skip to content

Beginning

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Oct 12, 2017 · 46 revisions

There are a whole lot of features in Cekirdekler API so one may not need %90 of them in first place. To pick the best feature to begin with (for GPGPU beginners), here is a list of reasons tied to a list of features:

I have a kernel to try just for learning OpenCL and comparing its perfromance against naive single thread solutions

  • Just start with a "hello world" and adjust it until it suits your compute requirements.
  • Be sure to have an opencl supporting device such as GPU or CPU and pick carefully that with AcceleratorType.CPU(or GPU) or its newer version Hardware.ClPlatforms.all().gpus()[0] as device parameter.
  • If you don't know anything about OpenCL, just start with only AcceleratorType.CPU or a newer way of picking it Hardware.ClPlatforms.all().cpus() first.
  • If you have more than one device, pick only one, using explicit device selection then apply it as parameter to hello world ClNumberCruncher constructor.

helloworld


I have a kernel that is embarrassingly paralel on both data and workitems in its nature

  • "Hello world" again. But this time you can enable features like event driven pipelining with a single true/false parameter in same compute() method, to make it faster by partitioning kernel into grains and separating read,write,compute parts of them if the device benefits from overlapped data movement+compute.

eventdrivenpipeline

  • driver controlled pipelining can also be activated with another true/false parameter near "enabling pipeline" parameter of compute() method. This leaves all overlapping behavior of grains to drivers of device.

driver controlled pipeline

  • This time also multiple GPUs are better since there is load balancer working automatically. All devices share accordingly with their performances. Most performant device gets most percentage of work. This minimizes compute() latency for time-critical applications.

load balancing

Multi GPUs can work with event/driver based pipelining too, as long as grain size and total size fits the requirements.


I need to build a custom pipeline with stages overlapped so I can squeeze latest bits of performance out of my GPU. Also my kernels are not separable to multiple GPUs because they have atomic functions and some rely on workgroup-id values.

  • There is SingleGPUPipeline namespace in Pipeline namespace which is in Cekirdekler namespace.
  • Create buffers for all stages, create all stages, create pipeline by this namespace.
  • Works with only a single device.
  • Increases througput but also increases "keyboard to pixel" latency. So is better for batch computing.

single device pipeline


I have several kernels such that they can't be partitioned into grains nor multiple GPUs yet I need all my GPUs work at the same time. How to keep GPUs busy so total work is done quicker?

  • Either use single gpu pipeline again but for all GPUs and add a data moving mechanism yourself to connect multiple pipelines or read the next option below.
  • If you have multiple devices, use Pipeline namespace for a device to device pipeline feature to have data flow from host to a gpu then to another gpu and back to host finally. This at least lets one use all GPUs when there are multiple kernels that send data to next kernel and not separable in terms of workitems.

device to device pipeline


I don't want the API to synchronize data between host and device at every single kernel for me. I want all kernel compute() instances handled at once and need to deliver my own queue scheme for a specific task. I have only a single device and I don't need driver/event pipelining.

  • Enable enqueue mode from the ClNumberCruncher instance
  • (optional) Enable no compute mode from ClNumberCruncher instance ---> this is async
  • (optional) Do any compute() (for only data upload and data download)---> this is async
  • (optional) Disable no compute mode from ClNumberCruncher instance---> this is async
  • Do any number of compute() (for data upload, compute, data download)---> this is async
  • (optional) Enable async enqueue mode from ClNumberCruncher instance---> this is async
  • (optional) Do any compute() (compute on another queue in parallel)---> this is async
  • (optional) Disable async enqueue mode from ClNumberCruncher instance---> this is async
  • Do some host side C# work ---> inherently async to GPU
  • Disable enqueue mode from ClNumberCruncher instance (this blocks until all finished)

enqueue mode


I have a lot of non-separable(with atomic functions,...) kernels each with different buffer parameters(or at least working on different ranges of a buffer) and also have multiple GPUs. Kernels may have same name or not, they also don't need to be synchronized before any other kernel. How can I schedule kernels to right devices for minimizing time taken to compute all? Long story short, I need something like "tiled rendering" but just for computing with multiple GPUs to get irresponsible amounts of performance.

  • Create ClTask instances with .task() method instead of using .compute() directly.
  • Add all created tasks to ClTaskPool.
  • Create ClDevicePool
  • Add all devices to ClDevicePool
  • Feed all ClTaskPool instances to ClDevicePool
  • Do some asynchronous C# work
  • Call synchronization method of ClDevicePool

enqueue mode


There is some work to do but not sure about how many workers are needed without parsing the data or computing first. I need the work scheduling handled by the GPU itself and I have an OpenCL 2.0 capable graphics card.

  • Make sure to get cekirdekler version 1.4.1+
  • Write OpenCL kernels that use dynamic parallelism commands such as enqueue_kernel() and get_default_queue().

dynamic parallelism


I need to upload data to GPU, repeat a kernel N times(with same global/local size), then download results to C#.

  • Set "Repeat count" property of ClNumberCruncher to a desired value.
  • compute() // blocks
  • Reset repeat count if not needed to repeat anymore.
  • also usable with enqueue mode
  • also usable on a series of different kernels: "kernel1 kernel2 kernel3" gets repeated with same order N times
  • if you need a counting/resetting kernel added to end of each iteration of repeats, set "repeat kernel name" field to a kernel name. Then repeating 3 consecutive kernels becomes "kernel1 kernel2 kernel3 repeatKernel kernel1 kernel2 kernel3 repeatKernel kernel1 kernel2 ...."
  • copying same kernel name in string parameter(separated with spaces, commas, ..) is also equivalent to repeating it but somewhat less efficient for some machines with less string performance.

I have C# arrays with primitive elements such as floats and bytes but I need to use them directly.

  • To start compute(), first array parameter has to be a ClArray<Type> and it can be initialized as:

    byte [] myArray = new byte[1024];
    ClArray<byte> myWrapper = myArray; // binds itself to C# array, pins it for compute
    
  • Then you can add additional parameters of pure C# type as you wish before compute():

    byte[] myArray = new byte[1024];
    ClArray<byte> myWrapper = myArray;
    float[] myArray2 = new float[1024];
    myWrapper.nextParam(myArray2).compute();
    

    here myArray2 could be of float[], byte[], ..., ClArray<float> and multiple of them can be added at once.

  • Using C# arrays directly has some performance pitfalls such as making extra buffer copies between host and host before going to device. Creating a ClArray instance with "new" uses C++ arrays internally and not only does less copies when possible, but also is aligned to 4096-byte boundaries for even faster access.

  • Single device pipeline feature can get pure C# arrays as buffers without using any ClArray, for each stage.


I don't need C# arrays, getting just a few result values after a compute is enough for me. Maximum buffer access performance between host and device, is what I need.

  • There is a streaming data feature per explicit device slection operation as a parameter. Also there is same streaming data parameter per ClNumberCruncher when no explicit device selection is used. These enable device-side streaming compliance. To enable this for buffers too, a ClArray instance with C++ array allocation(default with "new" or ClFloatArray, ClByteArray, ...) is needed and its zero copy field must be set before starting compute(). Zero copy means a GPU workitem will directly access RAM without copying anything. This is the fastest access mode and device-size streaming is enabled automatically for CPUs and integrated-GPUs(if they share same RAM with CPU).

    // when zero copy is enabled from buffer and streaming is enabled from device
    
    // zero copy capable   (equivalent to CL_MEM_USE_HOST_PTR)
    ClArray<byte> myWrapper = new ClArray<byte>(1024);  
    
    // zero copy capable   (equivalent to CL_MEM_USE_HOST_PTR)
    ClArray<byte> myWrapper2 = new ClByteArray(1024);   
    
    // slower,             (equivalent to CL_MEM_ALLOC_HOST_PTR) 
    // but still faster than non-streamed array
    // if data is accessed only once or twice per kernel
    ClArray<byte> myWrapper2 = new byte[1024];         
                                                        
                                                        
    
    // when zero copy is disabled from buffer or streaming is disabled from device
    
    // copies, fast   (equivalent to CL_MEM_READ_WRITE), faster when data is accessed multiple times
    ClArray<byte> myWrapper = new ClArray<byte>(1024);  
    
    // copies, fast   (equivalent to CL_MEM_READ_WRITE), faster when data is accessed multiple times
    ClArray<byte> myWrapper2 = new ClByteArray(1024);   
    
    // copies, slowest(equivalent to CL_MEM_READ_WRITE) , faster when data is accessed multiple times
    ClArray<byte> myWrapper2 = new byte[1024];          
    
  • Enable zero copy only when data is accessed once or twice per kernel. Disable it (that enables dedicated device memory usage for buffers), to access it multiple times faster.

  • One example for streaming(with zero copy) is c=a+b kernel. An element is accessed only once and embarrasingly parallel. One example for non streaming(with extra copies) is "array sort" kernel sequence which uses an array many times (Log(N) times)

  • If an array is meant to be read-only, enable readOnly property of ClArray before any compute(), to create its device buffer as CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY which implies host will only write to it, kernel will only read from it. If an array is meant to be write-only, enable writeOnly property on ClArray to gain CL_MEM_WRITE_ONLY | CL_MEM_HOST_READ_ONLY flags on device side buffers.

streaming data with zero copy using cl array

streaming data with zero copy using cs array

cs array

cl array

(images in this page were made in "schemeit" of www.digikey.com )