Skip to content

Pipelining: Single Device

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 13, 2017 · 12 revisions

Device Level Pipelining (many-kernel pipelining)

After v1.2.11, there is single device pipelining sub feature in pipeline namespace. It is used to build a pipeline with all stages running concurrent. This saves %30 to %50 compute time compared to a serially built version. Host-to-device transmissions, device-to-host transmissions and all kernel executions are enqueued to different command queues automatically and double buffering is allocated and handled behind the scene to make concurrency happen.

There is an example video of (image processing) pipeline showing %30 speedup compared to serial version:

IMAGE ALT TEXT HERE (there is also a matrix multiplication example at the lower part of this page)

To build a pipeline, first DevicePipeline class is instantiated with a single device and a string holding all stages kernel codes as parameter:

DevicePipeline gpuPipeline = new DevicePipeline(deviceForCompute,@""+File.ReadAllText("..//..//..//test.cl") );
gpuPipeline.enableSerialMode(); // for testing serial mode which sends all commands "in order" to gpu.

then stages are instantiated as below:

DevicePipelineStage stage1 = new DevicePipelineStage("resize", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage2 = new DevicePipelineStage("parameterSet", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage3 = new DevicePipelineStage("gaussianBlur", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage4 = new DevicePipelineStage("rotateImgRad", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage5 = new DevicePipelineStage("blendImg", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage6 = new DevicePipelineStage("postProcess", maxImgSizeResult * maxImgSizeResult, 256);

multiple kernels can be run per stage but they are needed to be separated same way with ClArray.compute(). If kernels "resize" and "parameterSet" are needed to be executed in same stage(but one after another, using same buffers+global range+local range), "resize parameterSet" achieves the desired behavior.

then necessary buffers are created and bound to stages:

// C# or C++ buffers
ClArray<byte> stage1Input = imageBytes; // a byte[] object taken from an image by lockbits method
ClArray<byte> stage5Input = imageBlendBytes;
ClArray<int> parameters = new int[1024];
ClArray<int> acculumulator = new int[1024];
ClArray<int> parametersPipe = new int[1024];
ClArray<int> parametersPipe2 = new int[1024];
ClArray<int> parametersPipe3 = new int[1024];
ClArray<int> parametersPipe4 = new int[1024];
ClArray<int> parametersPipe5 = new int[1024];
ClArray<int> parametersPipe6 = new int[1024];
ClArray<byte> resultImage = resultImageBytes;
ClArray<byte> pipeBuffer = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer2 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer3 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer4 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer5 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);

most of these buffers are used as a data path between stages and some are inputs while some are outputs of pipeline. To define each buffer's role, DevicePipelineArray class is used. It can be input,output,transition(between stages) and internal(private to a stage, only used for sequential logic). Also C# primitive arrays can be directly used instead of ClArray_Type_ or Cl_Type_Array.

TRANSITION means a stage writes to it, next stage reads it(if its bound there too).

INPUT means host writes to it, GPU reads from it

OUTPUT means opposite of INPUT

INTERNAL means its buffer is not sent to anywhere. It should also work as read-only for all stages but not tested.

DevicePipelineArray bufInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage1Input);
DevicePipelineArray bufBlendInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage5Input);
var bufAccumulator = new DevicePipelineArray(DevicePipelineArrayType.INTERNAL , acculumulator);
var bufPipe1 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer);
var bufPipe2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer2);
var bufPipe3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer3);
var bufPipe4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer4);
var bufPipe5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer5);
var bufPipeParameter = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe);
var bufPipeParameter2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe2);
var bufPipeParameter3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe3);
var bufPipeParameter4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe4);
var bufPipeParameter5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe5);
var bufPipeParameter6 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe6);
var bufResult = new DevicePipelineArray(DevicePipelineArrayType.OUTPUT , resultImage);

then these instances are bound to each stage accordingly with kernel parameter order and stages' relation:

stage1.bindArray(bufInput); // 1st parameter of kernel in kernel function definition in opencl-c string (test.cl)
stage1.bindArray(new DevicePipelineArray(DevicePipelineArrayType.INPUT, parameters)); // 2nd parameter
stage1.bindArray(bufPipeParameter); // 3rd parameter of kernel
stage1.bindArray(bufPipe1); // 4th parameter in stage1's kernel definiton

stage2.bindArray(bufPipe1); // this is same buffer that stage1's 4th parameter so it carries data to this stage
stage2.bindArray(bufPipeParameter); // this also a data path from stage1 to carry image processing related settings
stage2.bindArray(bufPipeParameter2);// this is data path beginning to next stage
stage2.bindArray(bufPipe2);         // datapath beginning for (resized and parameter-set)image data to next stage
stage2.bindArray(bufAccumulator);   // sequential logic, counts steps and adjusts settings accordingly

stage3.bindArray(bufPipe2);         // each same buffer carries data to neighbor stage if same buffer is used
stage3.bindArray(bufPipeParameter2);
stage3.bindArray(bufPipeParameter3);
stage3.bindArray(bufPipe3);         // for example, this bufPipe3 is written in this stage but is read in next stage

stage4.bindArray(bufPipe3);         // in here
stage4.bindArray(bufPipeParameter3);
stage4.bindArray(bufPipeParameter4);
stage4.bindArray(bufPipe4);

stage5.bindArray(bufPipe4);
stage5.bindArray(bufPipeParameter4);
stage5.bindArray(bufPipeParameter5);
stage5.bindArray(bufPipe5);
stage5.bindArray(bufBlendInput);   // an input in this stage, reads a second image's bytes for a blend operation

stage6.bindArray(bufPipe5);
stage6.bindArray(bufPipeParameter5);
stage6.bindArray(bufPipeParameter6);
stage6.bindArray(bufResult);         // output image is resized + rotated + blurred + blended version of input image

then prepared stages are added to pipeline with increasing order of their position in the pipeline:

gpuPipeline.addStage(stage1); 
gpuPipeline.addStage(stage2);
gpuPipeline.addStage(stage3);
gpuPipeline.addStage(stage4);
gpuPipeline.addStage(stage5);
gpuPipeline.addStage(stage6);

at last the only thing left is to start feeding the pipeline:

parameters[0] = wImg;
parameters[1] = hImg;
parameters[2] = maxImgSizeResult;
parameters[3] = i + 1;
parameters[4] = 225;
parameters[5] = 50; 
parameters[6] = 50;
parameters[7] = imgBlend.Width;
parameters[8] = imgBlend.Height;

                    
// pipeline starts working here
gpuPipeline.feedAsyncBegin();
// Oland(Advanced Micro Devices, Inc.) is a 5 compute unit processor that completes all stages in 17 ms
// serial mode is 39 ms which is more than %50 slower because serial code is less efficient
// kernels, device to host transmissions, host to device transmissions and this C# code area runs concurrently
// here runs asynchronously to gpgpu codes

// and ends here
gpuPipeline.feedAsyncEnd();

there are also other ways of feeding:

gpuPipeline.feedAsync((Action)delegate { /* async host codes here */ }); // blocks until finished while running some host code

or

gpuPipeline.feed(); /* simply blocks until finished */

gpuPipeline.enableSerialMode(); makes stages run one by one and give results in single feed while enableParallelMode() makes it parallel again so result-feed latency increases to M feed steps with each step having much less time latency than serial mode so throughput increases.


Example kernel codes for the example pipeline defined in the beginning of page:

test.cl

                __kernel void resize(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output){
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;
                    int imgW=parameters[0];
                    int imgH=parameters[1];
                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3];
                        parametersOut[4]=parameters[4];
                        parametersOut[5]=parameters[5];
                        parametersOut[6]=parameters[6];
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];
                        parametersOut[9]=parameters[9];                   
                    }



                    int imgX= floor((((float)imgW)/(float)resultImgSize) * pixelX) ;
                    int imgY= floor((((float)imgH)/(float)resultImgSize) * pixelY) ;

                    int pixelIndex=(imgX+imgY*imgW)*4;
                    uchar r=input[pixelIndex];
                    uchar g=input[pixelIndex+1];
                    uchar b=input[pixelIndex+2];
                    uchar a=input[pixelIndex+3];

                    output[(pixelX+pixelY*resultImgSize)*4]=r;
                    output[(pixelX+pixelY*resultImgSize)*4+1]=g;
                    output[(pixelX+pixelY*resultImgSize)*4+2]=b;
                    output[(pixelX+pixelY*resultImgSize)*4+3]=a;
                }

                __kernel void gaussianBlur(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output)
                {
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;

                    int imgW=resultImgSize;
                    int imgH=resultImgSize;
                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3];
                        parametersOut[4]=parameters[4];
                        parametersOut[5]=parameters[5];
                        parametersOut[6]=parameters[6];
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];                   
                        parametersOut[9]=parameters[9];                   
                    }

                    int totalPixelsProcessed=0;
                    int r=0;
                    int g=0;
                    int b=0;
                    int a=0;
                    int centerWeight=parametersOut[5];
                    int outerWeight=parametersOut[6];
                    for(int i=0-3;i<3;i++)
                        for(int j=-3;j<3;j++)
                        {
                             int pixelPickX=pixelX+j;
                             int pixelPickY=pixelY+i;
                             if((pixelPickX>=0) && (pixelPickX<imgW))
                             {
                                if((pixelPickY>=0) && (pixelPickY<imgH))
                                {
                                    int pixelIndex=(pixelPickX+pixelPickY*imgW)*4;
                                    int weight=0;
                                    if(i==0 && j==0)
                                    {
                                        weight=centerWeight;

                                    }
                                    else
                                    {
                                        weight=outerWeight;
                                    }
                                    r+=input[pixelIndex]*weight;
                                    g+=input[pixelIndex+1]*weight;
                                    b+=input[pixelIndex+2]*weight;
                                    a+=input[pixelIndex+3]*weight;


                                    totalPixelsProcessed+=weight;
                                }
                             }
                        }
                    output[(pixelX+pixelY*resultImgSize)*4]=r/totalPixelsProcessed;
                    output[(pixelX+pixelY*resultImgSize)*4+1]=g/totalPixelsProcessed;
                    output[(pixelX+pixelY*resultImgSize)*4+2]=b/totalPixelsProcessed;
                    output[(pixelX+pixelY*resultImgSize)*4+3]=255;
                }

                __kernel void rotateImgRad(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output)
                {
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;
                    float radians=((float)parameters[3])/(float)parameters[4];
                    int imgW=resultImgSize;
                    int imgH=resultImgSize;

                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3];
                        parametersOut[4]=parameters[4];
                        parametersOut[5]=parameters[5];
                        parametersOut[6]=parameters[6];
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];
                        parametersOut[9]=parameters[9];
                    }

                    int xc = imgW/2;
                    int yc = imgH/2;
                    int newx = ((float)pixelX-xc)*cos(radians) - ((float)pixelY-yc)*sin(radians) + xc;
                    int newy = ((float)pixelX-xc)*sin(radians) + ((float)pixelY-yc)*cos(radians) + yc;
                    if (newx >= 0 && newx < imgW && newy >= 0 && newy < imgH)
                    {
                        output[(pixelX+pixelY*resultImgSize)*4]  = input[(newx+newy*resultImgSize)*4];
                        output[(pixelX+pixelY*resultImgSize)*4+1]= input[(newx+newy*resultImgSize)*4+1];
                        output[(pixelX+pixelY*resultImgSize)*4+2]= input[(newx+newy*resultImgSize)*4+2];
                        output[(pixelX+pixelY*resultImgSize)*4+3]= input[(newx+newy*resultImgSize)*4+3];
                    }
                    else
                    {
                        output[(pixelX+pixelY*resultImgSize)*4]  = 0;
                        output[(pixelX+pixelY*resultImgSize)*4+1]= 0;
                        output[(pixelX+pixelY*resultImgSize)*4+2]= 0;
                        output[(pixelX+pixelY*resultImgSize)*4+3]= 255;
                    }
                }

                __kernel void parameterSet(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output, __global int * accumulator)
                {
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;
                    float radians=((float)parameters[3])/(float)parameters[4];
                    int imgW=resultImgSize;
                    int imgH=resultImgSize;

                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3]+accumulator[0];accumulator[0]++;
                        parametersOut[4]=parameters[4];

                        int maxWeight=30;
                        int minWeight=0;
                        float currentWeight=sin(((float)accumulator[0]/5.0f));
                        parametersOut[5]=70-currentWeight*maxWeight;
                        parametersOut[6]=30+currentWeight*maxWeight;
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];
                        parametersOut[9]=fabs(sin(((float)accumulator[0]/50.0f)))*100;
                    }

                    output[(pixelX+pixelY*resultImgSize)*4]  = input[(pixelX+pixelY*resultImgSize)*4]  ;
                    output[(pixelX+pixelY*resultImgSize)*4+1]= input[(pixelX+pixelY*resultImgSize)*4+1];
                    output[(pixelX+pixelY*resultImgSize)*4+2]= input[(pixelX+pixelY*resultImgSize)*4+2];
                    output[(pixelX+pixelY*resultImgSize)*4+3]= input[(pixelX+pixelY*resultImgSize)*4+3];
                }

                __kernel void blendImg(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output,__global uchar * inputBlend)
                {
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;
                    float radians=((float)parameters[3])/(float)parameters[4];
                    int imgW=resultImgSize;
                    int imgH=resultImgSize;

                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3];
                        parametersOut[4]=parameters[4];
                        parametersOut[5]=parameters[5];
                        parametersOut[6]=parameters[6];
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];
                        parametersOut[9]=parameters[9];
                    }
                    int w=parametersOut[7];
                    int h=parametersOut[8];
					uchar sampleR=0;
					uchar sampleG=0;
					uchar sampleB=0;
					if(pixelX<w && pixelY<h)
					{ 
						sampleR=inputBlend[(pixelX+pixelY*w)*4]  	 ;
						sampleG=inputBlend[(pixelX+pixelY*w)*4+1]	 ;
						sampleB=inputBlend[(pixelX+pixelY*w)*4+2]	 ;
					}
					
                    output[(pixelX+pixelY*resultImgSize)*4]  =mix(input[(pixelX+pixelY*resultImgSize)*4]    ,sampleR,  0.4f-0.4f*parameters[9]/101.0f);
                    output[(pixelX+pixelY*resultImgSize)*4+1]=mix(input[(pixelX+pixelY*resultImgSize)*4+1]  ,sampleG,0.4f-0.4f*parameters[9]/101.0f);
                    output[(pixelX+pixelY*resultImgSize)*4+2]=mix(input[(pixelX+pixelY*resultImgSize)*4+2]  ,sampleB,0.4f-0.4f*parameters[9]/101.0f);
                    output[(pixelX+pixelY*resultImgSize)*4+3]=255;

                }

                __kernel void postProcess(__global uchar * input, __global int * parameters, __global int * parametersOut, __global uchar * output)
                {
                    int resultImgSize=parameters[2];
                    int pixelX=get_global_id(0)%resultImgSize;
                    int pixelY=get_global_id(0)/resultImgSize;
                    float radians=((float)parameters[3])/(float)parameters[4];
                    int imgW=resultImgSize;
                    int imgH=resultImgSize;

                    if(get_global_id(0)==0)
                    {
                        parametersOut[0]=imgW;
                        parametersOut[1]=imgH;
                        parametersOut[2]=resultImgSize;
                        parametersOut[3]=parameters[3];
                        parametersOut[4]=parameters[4];
                        parametersOut[5]=parameters[5];
                        parametersOut[6]=parameters[6];
                        parametersOut[7]=parameters[7];
                        parametersOut[8]=parameters[8];
                        parametersOut[9]=parameters[9];
                    }
                    int w=parametersOut[7];
                    int h=parametersOut[8];
                    output[(pixelX+pixelY*resultImgSize)*4]  =input[(pixelX+pixelY*resultImgSize)*4]  ;
                    output[(pixelX+pixelY*resultImgSize)*4+1]=input[(pixelX+pixelY*resultImgSize)*4+1];
                    output[(pixelX+pixelY*resultImgSize)*4+2]=input[(pixelX+pixelY*resultImgSize)*4+2];
                    output[(pixelX+pixelY*resultImgSize)*4+3]=255;

                }

SGEMM(matrix multiplication pipeline example)

D=A*B+C

  • transpose B
  • multiply with A
  • add to C

first stage reads only B and transposes and passes it to an array

second stage gets transposed B and also reads A and does the multiplication and writes result to an array

last stage reads result, reads C and adds then writes result to an array

here are host codes:

           const int matrixSize = 2048;
           ...
           // 2048x2048 matrix multiplication and addition pipeline
            // D=AxB+C ----> step1=transpose B, step2=multiply A and B, step3=add result and C
            float[] matrixA = new float[matrixSize* matrixSize]; // host side variables
            float[] matrixB = new float[matrixSize * matrixSize];
            float[] matrixC = new float[matrixSize * matrixSize];
            float[] matrixD = new float[matrixSize * matrixSize];

            // pipeline builder, uses single gpu (R7-240 here, 320 cores @900 MHz)
            DevicePipeline sgemmPipeline = new DevicePipeline(ClPlatforms.all().gpus()[0], kernels, 16);
            //sgemmPipeline.enableSerialMode(); // 218 ms
            sgemmPipeline.enableParallelMode(); // overlapping stages now 186 ms nearly %15 faster because of less idle times

            // first it transposes B
            DevicePipelineStage transposeB = new DevicePipelineStage("transposeB", (matrixSize * matrixSize)/64, 64);
            DevicePipelineArray bufB = new DevicePipelineArray(DevicePipelineArrayType.INPUT, matrixB);
            ClArray<float> dataFlowForTransposedB = new ClArray<float>(matrixSize * matrixSize);
            DevicePipelineArray bufDataFlowB = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION, dataFlowForTransposedB);

            transposeB.bindArray(bufB);
            transposeB.bindArray(bufDataFlowB);

            // next stage multiplies A and B
            DevicePipelineStage multiplyAB = new DevicePipelineStage("multiplyAB", (matrixSize * matrixSize)/4, 256);
            DevicePipelineArray bufA = new DevicePipelineArray(DevicePipelineArrayType.INPUT, matrixA);
            ClArray<float> dataFlowForMultResult = new ClArray<float>(matrixSize * matrixSize);
            DevicePipelineArray bufDataFlowMultResult = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION, dataFlowForMultResult);

            multiplyAB.bindArray(bufA);
            multiplyAB.bindArray(bufDataFlowB);
            multiplyAB.bindArray(bufDataFlowMultResult);

            // result is added C
            DevicePipelineStage addC = new DevicePipelineStage("addC", matrixSize * matrixSize, 256);
            DevicePipelineArray bufC = new DevicePipelineArray(DevicePipelineArrayType.INPUT, matrixC);
            DevicePipelineArray bufD = new DevicePipelineArray(DevicePipelineArrayType.OUTPUT, matrixD);

            addC.bindArray(bufC);
            addC.bindArray(bufDataFlowMultResult);
            addC.bindArray(bufD);

            sgemmPipeline.addStage(transposeB);
            sgemmPipeline.addStage(multiplyAB);
            sgemmPipeline.addStage(addC);

            for (int i = 0; i < 35; i++)
            {
                matrixA[0] = 1000*i;
                matrixB[0] = 1000;
                matrixC[0] = 1000*i;
                benchStart();
                sgemmPipeline.feed(); // running
                benchStop("sgemm");
                Console.WriteLine(matrixD[0]);
            }

here are the kernel codes:

            __constant int size=" + matrixSize + @";
            __kernel void transposeB(__global float * b, __global float *t)
            {
                int id=get_global_id(0);

                int ls=64;
         
                int gid = id / ls;
                int gx=gid % (size/64);
                int gy=gid / (size/64);

                int blockX = gx*64;
                int blockY = gy*64;

                int tBlockX= blockY;
                int tBlockY= blockX;
                int lid=get_local_id(0);
                __local float block[64][64];
                __local float block2[64][64];

                // load
                for(int k=0;k<64;k++)
                {
                    block[lid][k]=b[(blockX+lid)+((blockY+k) * size)];
                }
                barrier(CLK_LOCAL_MEM_FENCE);

                // transpose tiled
                for(int k=0;k<64;k++)
                {
                       block2[lid][k] = block[k][lid];   
                }
                barrier(CLK_LOCAL_MEM_FENCE);


                // store
                for(int k=0;k<64;k++)
                {
                    t[(tBlockX+lid)+((tBlockY+k) * size)]=block2[lid][k];
                }
            }

            __kernel void multiplyAB(__global float *a,__global float *b,__global float * result)
             {
                int id=get_global_id(0);
                int lId=get_local_id(0);
                __local float tileA[32][32];
                __local float tileB[32][32];
        
                int lx=lId%16;
                int ly=lId/16;
                
                int gid=id/256;
                int gx=gid%(size/32);
                int gy=gid/(size/32);

                int x=gx*32;
                int y=gy*32;
                int n=size/32; // number of tiles in a row



                float accumulator=0.0f;
                float accumulator2=0.0f;
                float accumulator3=0.0f;
                float accumulator4=0.0f;
                int ly16=ly+16;
                int lx16=lx+16;
                for(int i=0;i<n;i++)
                {
                    // select tile from row in A
                    tileA[lx][ly]=a[(i*32+lx)+(y+ly)*size];
                    tileA[lx+16][ly]=a[(i*32+lx+16)+(y+ly)*size];
                    tileA[lx][ly+16]=a[(i*32+lx)+(y+ly+16)*size];
                    tileA[lx+16][ly+16]=a[(i*32+lx+16)+(y+ly+16)*size];

                    // select tile from row in transposed B 
                    tileB[ly][lx]=b[(i*32+lx)+(x+ly)*size];
                    tileB[ly16][lx]=b[(i*32+lx)+(x+ly16)*size];
                    tileB[ly][lx16]=b[(i*32+lx16)+(x+ly)*size];
                    tileB[ly16][lx16]=b[(i*32+lx16)+(x+ly16)*size];

                    barrier(CLK_LOCAL_MEM_FENCE);   
                    

                    // multiply tiles
                    for(int e=0;e<32;e++)
                    {
                        float tmp1=tileA[e][ly];
                        float tmp2=tileB[e][lx];
                        accumulator +=tmp1*tmp2; 
                        float tmp3=tileA[e][ly16];
                        accumulator2+=tmp3*tmp2; 
                        float tmp4=tileB[e][lx16];
                        accumulator3+=tmp1*tmp4; 
                        accumulator4+=tmp3*tmp4; 
                    }


     
                }
                // write tileC to result
                result[(x+lx)+(y+ly)*size]=accumulator;
                barrier(CLK_GLOBAL_MEM_FENCE);   
                result[(x+lx+16)+(y+ly)*size]=accumulator2;
                barrier(CLK_GLOBAL_MEM_FENCE);   
                result[(x+lx)+(y+ly+16)*size]=accumulator3;
                barrier(CLK_GLOBAL_MEM_FENCE);   
                result[(x+lx+16)+(y+ly+16)*size]=accumulator4;

                                    // 32 by 32 tiles multiplication but still inefficient
            }  

         
            __kernel void addC(__global float *c,__global float * multResult,__global float * result)
             {
                int id=get_global_id(0);
                int i=id%size;
                int j=id/size;
                int index=j*size+i;

                result[index]=multResult[index]+c[index];
            }