Skip to content

Pipelining

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 13, 2017 · 18 revisions

Kernel Level Pipelining (partitioning a kernel)

This page tells about only event and driver based pipelining

Pipelining is used when a system can work on different things concurrently and it hides latencies of some stages. In case of a GPU, there could be a buffer read, a buffer write and a kernel execution happening at the same time. For more advanced(new) GPUs, even multiple instances of same type operations can be executed. All these are harnessed by using multiple commandqueues for each opencl device. Even a cheap R7-240 AMD GPU can work with 16 queues concurrently in same context. This makes it easier to top its performance and get closer to its advertised performance limits.

Cekirdekler API supports four types of pipelining. Two for separable kernels and another two for non-separable kernels. "Separable" versions are "event based" and "driver controlled".

To enable event/driver pipelining, some optional parameters(for .compute()) are needed to be adjusted and partialRead flag must be set. If a workitem in a kernel accesses randomy to any element of array, it is not applicable for partial reading since there is no synchronization between workgroups in a kernel and pipelining already needs a divisible work such as adding 3 to all elements(embarrassingly parallel). But, if a workitem accesses to only its own workgroups' array range, it is applicable for pipelining no matter what array access pattern it is using.

Arrays with partialRead=false are not taken into pipeline but is read at once before pipelining begins, so developer can still have global array access for some arrays for some pipelined kernels with pipelined arrays.

Here is how pipelining is activated:

     ClArray<float> a = new ClArray<float>(1024*1024);
     a.partialRead = true;
     a.compute(cr, 1, "loadBalanceTest", 1024*1024, 64,0,true,Cores.PIPELINE_EVENT,4);
                                                          ^           ^            ^
                                                          |           |            |
                                                          |   type of pipeline     |
                                                 enables pipelining                |
                                                                         number of blobs
                                                                                per 
                                                                            queue  type

the pipelining option divides each devices' own global range into P(4 in this example) parts and those parts are synchronized at each sync point with events like in this scheme:

  
     R: read, W:write,   C: compute
     |: synchronization, x: idle
        
     time..: 1   2   3   4   5   6
     queue1: R | R | R | R | x | x     [queue for only reads]
     queue2: x | C | C | C | C | x     [queue for only compute]
     queue3: x | x | W | W | W | W     [queue for only write]

if Cores.PIPELINE_DRIVER is used to select other pipeline type, the workflow becomes driver-related and each blob consists of 3 operations: read+compute+write:

 time..:   1     2     
 queue1: R-C-W  R-C-W 
 queue2: R-C-WR-C-W

there is no synchronization between blobs in this case of pipelining. Consecutive operations immediately start without any event overhead and driver handles necessary resource allocation between queues.


Cekirdekler API uses 6 command queues for Cores.PIPELINE_EVENT and each queue type is duplicated so two reads can happen at the same time and two computes can happen at the same time, just like writes.

Cores.PIPELINE_DRIVER is run on 16 command queues each running all read-compute-write parts of each blob and each queue is equally populated with blobs so they start getting saturated after 16 blobs. If data is 16MB then a 16 blobl pipeline would distribute 1MB data on each command queue and all operations get overlapped efficiently.

Each blob also increases load-balance grain size. If there is only 1 workgroup sent to each device, they can't enable pipeline. Developer is responsible for evading this error. That's why load balancing becomes harder when number of blobs are increased to a very high value like 128. This means each device must get at least 128 workgroups and must be able to trade that amount for load balancing.

Pipelining makes library related codes somewhat more complex and hard to read(to develop it more) but all in all compute() finishes quicker with it.