-
Notifications
You must be signed in to change notification settings - Fork 10
Benchmarks: Post your 1D NBODY scores
If you have too many GPUs, disable streaming and zero copy options. Nbody algorithm accesses each particle N times. This means when there are 8 GPUs, an element in RAM is accessed for N*8 times and this goes through pci-e bridge. When streaming is disabled, all GPUs get their copy then they work on their own dedicated memory.
32-k particles test(uses streaming and zero copy for performance)
int numberOfParticles = 1024 * 32 ;
ClPlatforms platforms = ClPlatforms.all();
var selectedDevices = platforms.devicesWithMostComputeUnits().gpus(true);
selectedDevices.logInfo();
ClNumberCruncher gpu = new ClNumberCruncher(selectedDevices, @"
__kernel void algorithmTest(__global float4 * a,__global float4 * b)
{
int i=get_global_id(0);
float4 accumulator1=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator2=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator3=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator4=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 a0_1=a[i];
float4 a0_2=a0_1.s1230;
float4 a0_3=a0_1.s2301;
float4 a0_4=a0_1.s3012;
for(int j=0;j<" + (numberOfParticles/4) + @";j++)
{
float4 a1=a[j];
float4 difference1=a0_1-a1;
accumulator1+=sqrt(difference1*difference1);
float4 difference2=a0_2-a1;
accumulator2+=sqrt(difference2*difference2);
float4 difference3=a0_3-a1;
accumulator3+=sqrt(difference3*difference3);
float4 difference4=a0_4-a1;
accumulator4+=sqrt(difference4*difference4);
}
b[i]+=(accumulator1+accumulator2+accumulator3+accumulator4);
}
");
ClArray<float> f = new ClArray<float>(numberOfParticles);
f.numberOfElementsPerWorkItem = 4;
f.readOnly=true;
f.zeroCopy=true;
f.write = false;
f.readOnly = true;
ClArray<float> g = new ClArray<float>(numberOfParticles);
g.numberOfElementsPerWorkItem = 4;
g.zeroCopy=true;
g.read = false;
gpu.performanceFeed = true;
for (int i = 0; i < 100; i++)
f.nextParam(g).compute(gpu, 1, "algorithmTest", numberOfParticles/4, 64);
1M particles test(disabled streaming and zero copy for many-GPU performance):
int numberOfParticles = 1024 * 1024;
ClPlatforms platforms1 = ClPlatforms.all();
var selectedDevices1 = platforms1.devicesWithMostComputeUnits().gpus();
selectedDevices1.logInfo();
ClNumberCruncher gpu00 = new ClNumberCruncher(selectedDevices1, @"
__kernel void algorithmTest(__global float4 * a,__global float4 * b)
{
int i=get_global_id(0);
float4 accumulator1=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator2=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator3=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 accumulator4=(float4)(0.0f,0.0f,0.0f,0.0f);
float4 a0_1=a[i];
float4 a0_2=a0_1.s1230;
float4 a0_3=a0_1.s2301;
float4 a0_4=a0_1.s3012;
for(int j=0;j<" + (numberOfParticles / 4) + @";j++)
{
float4 a1=a[j];
float4 difference1=a0_1-a1;
accumulator1+=sqrt(difference1*difference1);
float4 difference2=a0_2-a1;
accumulator2+=sqrt(difference2*difference2);
float4 difference3=a0_3-a1;
accumulator3+=sqrt(difference3*difference3);
float4 difference4=a0_4-a1;
accumulator4+=sqrt(difference4*difference4);
}
b[i]+=(accumulator1+accumulator2+accumulator3+accumulator4);
}
");
ClArray<float> f = new ClArray<float>(numberOfParticles);
f.readOnly = true;
f.numberOfElementsPerWorkItem = 4;
f.write = false;// not needed after readOnly=true
ClArray<float> g = new ClArray<float>(numberOfParticles);
g.numberOfElementsPerWorkItem = 4;
g.read = false; // not using writeOnly because kernel code reads and increments its elements.
gpu00.performanceFeed = true;
for (int i = 0; i < 100; i++)
f.nextParam(g).compute(gpu00, 1, "algorithmTest", numberOfParticles / 4, 64);
32k particles Results
Development system (steam/zero copy is enabled):
Compute-ID: 1 ----- Load Distributions: [.8%] - [99.2%] -------------------------------------------------------
Device 0(stream): AMD FX(tm)-8150 Eight-Core Proce ||| time: 342.29ms, workitems: 64
Device 1(gddr): Oland ||| time: 34.08ms, workitems: 8,128
-----------------------------------------------------------------------------------------------------------------
it seems brute-force nbody algorithm overwhelms CPU. Maybe because of CPU not having enough registers for all threads, inevidably using RAM, even for __private opencl variables in kernel, maybe its just drivers.
So, taking only R7-240, it is computing 32 giga-square-roots-per-second.
Amd RX550 (steam/zero copy is enabled):
Compute-ID: 1 ----- Load Distributions: [100.0%] --------------------------------------------------------------
Device 0(gddr): gfx804 ||| time: 11.86ms, workitems: 8,192
-----------------------------------------------------------------------------------------------------------------
Amd RX-550(beta drivers) and R7-240 (steam/zero copy is enabled):
Compute-ID: 1 ----- Load Distributions: [71.9%] - [28.1%] -----------------------------------------------------
Device 0(gddr): gfx804 ||| time: 9.15ms, workitems: 5,888
Device 1(gddr): Oland ||| time: 9.25ms, workitems: 2,304
-----------------------------------------------------------------------------------------------------------------
RX550 has 512 cores at 1300MHz while R7_240 has 320 cores at 900MHz. Estimated performances: 2.3x + 1x. Benchmark result: 72/28 = nearly 2.6x. RX must have some architectural optimizations or it must be the 112GB/s bandwidth instead of 25GB/s of R7-240.
Nvidia GTX-660ti (steam/zero copy is enabled):
Compute-ID: 1 ----- Load Distributions: [100.0%] --------------------------------------------------------------
Device 0(gddr): GeForce GTX 660 Ti ||| time: 12.5ms, workitems: 8,192
-----------------------------------------------------------------------------------------------------------------
Contributor: cmisztur
ASUS Z270A, 8 GPU build, production (steam/zero copy is enabled)
1 cores are chosen for compute(equals to device partition cores).
1 cores are chosen for compute(equals to device partition cores).
---------
Selected devices:
#0: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#1: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#2: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#3: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#4: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#5: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#6: Ellesmere(Advanced Micro Devices, Inc.) number of compute units: 36 type:GPU memory: GB
#7: GeForce GTX 1070(NVIDIA Corporation) number of compute units: 15 type:GPU memory: GB
#8: Intel(R) HD Graphics 510(Intel(R) Corporation) number of compute units: 12 type:GPU memory: 3.14GB
#9: Intel(R) Pentium(R) CPU G4400 @ 3.30GHz(Intel(R) Corporation) number of compute units: 1 type:CPU memory: 3.87GB
#10: Intel(R) Pentium(R) CPU G4400 @ 3.30GHz(GenuineIntel) number of compute units: 1 type:CPU memory: 3.87GB
---------
Compute-ID: 1 ----- Load Distributions: [25.8%] - [10.9%] - [5.5%] - [5.5%] - [4.7%] - [25.8%] - [6.3%] - [2.3%] - [3.1%] - [4.7%] - [5.5%] -------------------------------------------------
Device 0(gddr): Ellesmere ||| time: 57.67ms, workitems: 2,112
Device 1(gddr): Ellesmere ||| time: 39.31ms, workitems: 896
Device 2(gddr): Ellesmere ||| time: 58.25ms, workitems: 448
Device 3(gddr): Ellesmere ||| time: 62.35ms, workitems: 448
Device 4(gddr): Ellesmere ||| time: 59.54ms, workitems: 384
Device 5(gddr): Ellesmere ||| time: 40ms, workitems: 2,112
Device 6(gddr): Ellesmere ||| time: 58.78ms, workitems: 512
Device 7(stream): Intel(R) Pentium(R) CPU G4400 @ ||| time: 59.2ms, workitems: 192
Device 8(gddr): GeForce GTX 1070 ||| time: 57.88ms, workitems: 256
Device 9(stream): Intel(R) HD Graphics 510 ||| time: 62.13ms, workitems: 384
Device 10(stream): Intel(R) Pentium(R) CPU G4400 @ ||| time: 56.39ms, workitems: 448
-----------------------------------------------------------------------------------------------------------------
Contributor: cmisztur
VMware Virtual Machine, development, CPU only.
1 cores are chosen for compute(equals to device partition cores).
---------
Selected devices:
#0: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz(Intel(R) Corporati number of compute units: 1 type:CPU memory: 4GB
---------
Compute-ID: 1 ----- Load Distributions: [100.0%] --------------------------------------------------------------
Device 0(stream): Intel(R) Xeon(R) CPU E ||| time: 1,901.54ms, workitems: 8,192
-----------------------------------------------------------------------------------------------------------------
1M particles Results
Compute-ID: 1 ----- Load Distributions: [70.2%] - [29.8%] -----------------------------------------------------
Device 0(gddr): gfx804 ||| time: 8,184.47ms, workitems: 184,000
Device 1(gddr): Oland ||| time: 8,254.51ms, workitems: 78,144
-----------------------------------------------------------------------------------------------------------------