Skip to content

Benchmarks: Post your 1D NBODY scores

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Jun 15, 2017 · 16 revisions

If you have too many GPUs, disable streaming and zero copy options. Nbody algorithm accesses each particle N times. This means when there are 8 GPUs, an element in RAM is accessed for N*8 times and this goes through pci-e bridge. When streaming is disabled, all GPUs get their copy then they work on their own dedicated memory.

32-k particles test(uses streaming and zero copy for performance)

int numberOfParticles = 1024 * 32 ;
ClPlatforms platforms = ClPlatforms.all();

var selectedDevices = platforms.devicesWithMostComputeUnits().gpus(true);

selectedDevices.logInfo();

ClNumberCruncher gpu = new ClNumberCruncher(selectedDevices, @"
     __kernel void algorithmTest(__global float4 * a,__global float4 * b)
     { 
          int i=get_global_id(0);
          float4 accumulator1=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator2=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator3=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator4=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 a0_1=a[i];
          float4 a0_2=a0_1.s1230;
          float4 a0_3=a0_1.s2301;
          float4 a0_4=a0_1.s3012;
          for(int j=0;j<" + (numberOfParticles/4) + @";j++)
          {
              float4 a1=a[j];
              float4 difference1=a0_1-a1;
              accumulator1+=sqrt(difference1*difference1);
              float4 difference2=a0_2-a1;
              accumulator2+=sqrt(difference2*difference2);
              float4 difference3=a0_3-a1;
              accumulator3+=sqrt(difference3*difference3);
              float4 difference4=a0_4-a1;
              accumulator4+=sqrt(difference4*difference4);
          }
          b[i]+=(accumulator1+accumulator2+accumulator3+accumulator4);
     }
");
ClArray<float> f = new ClArray<float>(numberOfParticles);
f.numberOfElementsPerWorkItem = 4;
f.readOnly=true;
f.zeroCopy=true;
f.write = false;
f.readOnly = true;
ClArray<float> g = new ClArray<float>(numberOfParticles);
g.numberOfElementsPerWorkItem = 4;
g.zeroCopy=true;
g.read = false;
gpu.performanceFeed = true;
for (int i = 0; i < 100; i++)
     f.nextParam(g).compute(gpu, 1, "algorithmTest", numberOfParticles/4, 64);

1M particles test(disabled streaming and zero copy for many-GPU performance):

int numberOfParticles = 1024 * 1024;

ClPlatforms platforms1 = ClPlatforms.all();

var selectedDevices1 = platforms1.devicesWithMostComputeUnits().gpus();

selectedDevices1.logInfo();

ClNumberCruncher gpu00 = new ClNumberCruncher(selectedDevices1, @"
     __kernel void algorithmTest(__global float4 * a,__global float4 * b)
     { 
          int i=get_global_id(0);
          float4 accumulator1=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator2=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator3=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 accumulator4=(float4)(0.0f,0.0f,0.0f,0.0f);
          float4 a0_1=a[i];
          float4 a0_2=a0_1.s1230;
          float4 a0_3=a0_1.s2301;
          float4 a0_4=a0_1.s3012;
          for(int j=0;j<" + (numberOfParticles / 4) + @";j++)
          {
              float4 a1=a[j];
              float4 difference1=a0_1-a1;
              accumulator1+=sqrt(difference1*difference1);
              float4 difference2=a0_2-a1;
              accumulator2+=sqrt(difference2*difference2);
              float4 difference3=a0_3-a1;
              accumulator3+=sqrt(difference3*difference3);
              float4 difference4=a0_4-a1;
              accumulator4+=sqrt(difference4*difference4);
          }
          b[i]+=(accumulator1+accumulator2+accumulator3+accumulator4);
    }
");
ClArray<float> f = new ClArray<float>(numberOfParticles);
f.readOnly = true;
f.numberOfElementsPerWorkItem = 4;
f.write = false;// not needed after readOnly=true 
ClArray<float> g = new ClArray<float>(numberOfParticles);
g.numberOfElementsPerWorkItem = 4;
g.read = false; // not using writeOnly because kernel code reads and increments its elements.
gpu00.performanceFeed = true;
for (int i = 0; i < 100; i++)
     f.nextParam(g).compute(gpu00, 1, "algorithmTest", numberOfParticles / 4, 64);


32k particles Results

Development system (steam/zero copy is enabled):

Compute-ID: 1  ----- Load Distributions:  [.8%] - [99.2%] -------------------------------------------------------
Device 0(stream): AMD FX(tm)-8150 Eight-Core Proce ||| time: 342.29ms, workitems: 64
Device 1(gddr): Oland                              ||| time: 34.08ms, workitems: 8,128
-----------------------------------------------------------------------------------------------------------------

it seems brute-force nbody algorithm overwhelms CPU. Maybe because of CPU not having enough registers for all threads, inevidably using RAM, even for __private opencl variables in kernel, maybe its just drivers.

So, taking only R7-240, it is computing 32 giga-square-roots-per-second.


Amd RX550 (steam/zero copy is enabled):

Compute-ID: 1  ----- Load Distributions:  [100.0%] --------------------------------------------------------------
Device 0(gddr): gfx804                             ||| time: 11.86ms, workitems: 8,192
-----------------------------------------------------------------------------------------------------------------

Amd RX-550(beta drivers) and R7-240 (steam/zero copy is enabled):

Compute-ID: 1  ----- Load Distributions:  [71.9%] - [28.1%] -----------------------------------------------------
Device 0(gddr): gfx804                             ||| time: 9.15ms, workitems: 5,888
Device 1(gddr): Oland                              ||| time: 9.25ms, workitems: 2,304
-----------------------------------------------------------------------------------------------------------------

RX550 has 512 cores at 1300MHz while R7_240 has 320 cores at 900MHz. Estimated performances: 2.3x + 1x. Benchmark result: 72/28 = nearly 2.6x. RX must have some architectural optimizations or it must be the 112GB/s bandwidth instead of 25GB/s of R7-240.


Nvidia GTX-660ti (steam/zero copy is enabled):

Compute-ID: 1 ----- Load Distributions: [100.0%] -------------------------------------------------------------- 
Device 0(gddr): GeForce GTX 660 Ti ||| time: 12.5ms, workitems: 8,192 
----------------------------------------------------------------------------------------------------------------- 

Contributor: cmisztur

ASUS Z270A, 8 GPU build, production (steam/zero copy is enabled)

1 cores are chosen for compute(equals to device partition cores).
1 cores are chosen for compute(equals to device partition cores).
---------
Selected devices:
#0: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#1: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#2: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#3: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#4: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#5: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#6: Ellesmere(Advanced Micro Devices, Inc.)                             number of compute units:  36    type:GPU      memory: GB
#7: GeForce GTX 1070(NVIDIA Corporation)                                number of compute units:  15    type:GPU      memory: GB
#8: Intel(R) HD Graphics 510(Intel(R) Corporation)                      number of compute units:  12    type:GPU      memory: 3.14GB
#9: Intel(R) Pentium(R) CPU G4400 @ 3.30GHz(Intel(R) Corporation)       number of compute units:   1    type:CPU      memory: 3.87GB
#10: Intel(R) Pentium(R) CPU G4400 @ 3.30GHz(GenuineIntel)              number of compute units:   1    type:CPU      memory: 3.87GB
---------

Compute-ID: 1  ----- Load Distributions:  [25.8%] - [10.9%] - [5.5%] - [5.5%] - [4.7%] - [25.8%] - [6.3%] - [2.3%] - [3.1%] - [4.7%] - [5.5%] -------------------------------------------------
Device 0(gddr): Ellesmere                          ||| time: 57.67ms, workitems: 2,112
Device 1(gddr): Ellesmere                          ||| time: 39.31ms, workitems: 896
Device 2(gddr): Ellesmere                          ||| time: 58.25ms, workitems: 448
Device 3(gddr): Ellesmere                          ||| time: 62.35ms, workitems: 448
Device 4(gddr): Ellesmere                          ||| time: 59.54ms, workitems: 384
Device 5(gddr): Ellesmere                          ||| time: 40ms, workitems: 2,112
Device 6(gddr): Ellesmere                          ||| time: 58.78ms, workitems: 512
Device 7(stream): Intel(R) Pentium(R) CPU G4400 @  ||| time: 59.2ms, workitems: 192
Device 8(gddr): GeForce GTX 1070                   ||| time: 57.88ms, workitems: 256
Device 9(stream): Intel(R) HD Graphics 510         ||| time: 62.13ms, workitems: 384
Device 10(stream): Intel(R) Pentium(R) CPU G4400 @ ||| time: 56.39ms, workitems: 448
-----------------------------------------------------------------------------------------------------------------

Contributor: cmisztur

VMware Virtual Machine, development, CPU only.

1 cores are chosen for compute(equals to device partition cores).
---------
Selected devices:
#0: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz(Intel(R) Corporati  number of compute units:   1    type:CPU      memory: 4GB
---------

Compute-ID: 1  ----- Load Distributions:  [100.0%] --------------------------------------------------------------
Device 0(stream): Intel(R) Xeon(R) CPU           E ||| time: 1,901.54ms, workitems: 8,192
-----------------------------------------------------------------------------------------------------------------



1M particles Results

Compute-ID: 1  ----- Load Distributions:  [70.2%] - [29.8%] -----------------------------------------------------
Device 0(gddr): gfx804                             ||| time: 8,184.47ms, workitems: 184,000
Device 1(gddr): Oland                              ||| time: 8,254.51ms, workitems: 78,144
-----------------------------------------------------------------------------------------------------------------