[benchmarks] don't fail on suite setup issues #2654

pbalcer · 2025-02-03T11:59:40Z

No description provided.

github-actions · 2025-02-03T12:00:33Z

Compute Benchmarks level_zero run (with params: --sycl-target intel_gpu_pvc):
https://github.com/oneapi-src/unified-runtime/actions/runs/13112970891

github-actions · 2025-02-03T12:25:13Z

Compute Benchmarks level_zero run (--sycl-target intel_gpu_pvc):
https://github.com/oneapi-src/unified-runtime/actions/runs/13112970891
Job status: success. Test status: success.

Summary

Total 79 benchmarks in mean.
Geomean 100.132%.
Improved 13 Regressed 9 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 99.570%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_l0 SubmitKernel out of order	11.708000 μs	11.868 μs	101.37%	1.37%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.098000 μs	2.113 μs	100.71%	0.71%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104663.000000 instr	104663.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110006.000000 instr	110006.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	123063.000 instr	122876.000000 instr	99.85%	-0.15%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.072 μs	21.005000 μs	99.68%	-0.32%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.689 μs	1.679000 μs	99.41%	-0.59%	.
api_overhead_benchmark_ur SubmitKernel in order	16.358 μs	16.241000 μs	99.28%	-0.72%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.312 μs	24.133000 μs	99.26%	-0.74%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.915 μs	15.750000 μs	98.96%	-1.04%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.382 μs	22.969000 μs	98.23%	-1.77%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.636 μs	11.418000 μs	98.13%	-1.87%	.

Relative perf in group memory (4): 100.222%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.183000 GB/s	3.158 GB/s	100.79%	0.79%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	250.480000 μs	251.872 μs	100.56%	0.56%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.550000 μs	5.573 μs	100.41%	0.41%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.629 μs	132.472000 μs	99.13%	-0.87%	.

Relative perf in group miscellaneous (1): 100.034%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	860.370000 bw GB/s	860.664 bw GB/s	100.03%	0.03%	.

Relative perf in group multithread (10): 99.357%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7425.063000 μs	7472.404 μs	100.64%	0.64%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6930.378000 μs	6939.950 μs	100.14%	0.14%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1200.630000 μs	1201.865 μs	100.10%	0.10%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2095.744 μs	2093.086000 μs	99.87%	-0.13%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113100.883 μs	112790.682000 μs	99.73%	-0.27%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8725.475 μs	8689.121000 μs	99.58%	-0.42%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	41248.839 μs	40846.653000 μs	99.02%	-0.98%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26016.861 μs	25587.435000 μs	98.35%	-1.65%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17482.001 μs	17154.077000 μs	98.12%	-1.88%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47867.991 μs	46935.372000 μs	98.05%	-1.95%	.

Relative perf in group graph (10): 100.689%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5585.614000 μs	5721.966 μs	102.44%	2.44%	++
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56449.397000 μs	57817.523 μs	102.42%	2.42%	++
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5591.936000 μs	5688.177 μs	101.72%	1.72%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.106000 μs	62.367 μs	100.42%	0.42%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72509.690000 μs	72642.878 μs	100.18%	0.18%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.513000 μs	54.566 μs	100.10%	0.10%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353437.568000 μs	353502.721 μs	100.02%	0.02%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71760.323 μs	71747.470000 μs	99.98%	-0.02%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353432.608 μs	353339.946000 μs	99.97%	-0.03%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	676.466 μs	674.284000 μs	99.68%	-0.32%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (5): 101.878%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2862.910000 ns	3174.620 ns	110.89%	10.89%	++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2085.710000 ns	2192.650 ns	105.13%	5.13%	++++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	310.765 ns	306.767000 ns	98.71%	-1.29%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2783.700 ns	2735.530000 ns	98.27%	-1.73%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2699.630 ns	2620.060000 ns	97.05%	-2.95%	--

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (5): 100.230%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	192.698000 ns	195.988 ns	101.71%	1.71%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	267.576000 ns	271.315 ns	101.40%	1.40%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	213.528000 ns	213.992 ns	100.22%	0.22%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	718.204 ns	711.693000 ns	99.09%	-0.91%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	719.650 ns	710.790000 ns	98.77%	-1.23%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (5): 98.903%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1715.220000 ns	1936.480 ns	112.90%	12.90%	+++++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3193.060000 ns	3386.980 ns	106.07%	6.07%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	249.441000 ns	253.226 ns	101.52%	1.52%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1408.400 ns	1267.280000 ns	89.98%	-10.02%	-------
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1421.850 ns	1230.060000 ns	86.51%	-13.49%	----------

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (5): 99.348%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	289.022000 ns	299.838 ns	103.74%	3.74%	+++
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.496000 ns	192.935 ns	100.75%	0.75%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	208.625 ns	206.336000 ns	98.90%	-1.10%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	755.180 ns	730.895000 ns	96.78%	-3.22%	--
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	752.595 ns	727.999000 ns	96.73%	-3.27%	--

Relative perf in group alloc/min (6): 102.427%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	993.201000 ns	1128.250 ns	113.60%	13.60%	++++++++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	176.046000 ns	182.287 ns	103.55%	3.55%	+++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	962.199000 ns	968.189 ns	100.62%	0.62%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	834.393000 ns	834.560 ns	100.02%	0.02%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	814.536 ns	809.442000 ns	99.37%	-0.63%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	180.546 ns	177.227000 ns	98.16%	-1.84%	.

Relative perf in group multiple (16): 99.885%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	26694.800000 ns	27865.300 ns	104.38%	4.38%	+++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	139774.000000 ns	144859.000 ns	103.64%	3.64%	+++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4130.730000 ns	4241.250 ns	102.68%	2.68%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1151730.000000 ns	1181150.000 ns	102.55%	2.55%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	74249.600000 ns	75687.100 ns	101.94%	1.94%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	158545.000000 ns	160647.000 ns	101.33%	1.33%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	138895.000 ns	138580.000000 ns	99.77%	-0.23%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	31092.200 ns	31018.400000 ns	99.76%	-0.24%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139588.000 ns	139089.000000 ns	99.64%	-0.36%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15375.100 ns	15279.900000 ns	99.38%	-0.62%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4256.920 ns	4200.920000 ns	98.68%	-1.32%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42358.400 ns	41527.800000 ns	98.04%	-1.96%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25788.500 ns	25041.800000 ns	97.10%	-2.90%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1198540.000 ns	1162710.000000 ns	97.01%	-2.99%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32125.800 ns	31133.200000 ns	96.91%	-3.09%	--
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31533.000 ns	30222.700000 ns	95.84%	-4.16%	---

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.375158 M keys/sec
Velocity-Bench Bitcracker	-	35.965200 s
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench QuickSilver	-	117.580000 MMS/CTT
Velocity-Bench Sobel Filter	-	611.944000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	824.202968 token/s
llama.cpp Text Generation Batched 128	-	62.990615 token/s
llama.cpp Prompt Processing Batched 256	-	870.375426 token/s
llama.cpp Text Generation Batched 256	-	62.990517 token/s
llama.cpp Prompt Processing Batched 512	-	429.991968 token/s
llama.cpp Text Generation Batched 512	-	62.959741 token/s

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-02-03T15:01:14Z

Compute Benchmarks level_zero run (with params: --sycl-target intel_gpu_pvc):
https://github.com/oneapi-src/unified-runtime/actions/runs/13116361310

github-actions · 2025-02-03T15:24:50Z

Compute Benchmarks level_zero run (--sycl-target intel_gpu_pvc):
https://github.com/oneapi-src/unified-runtime/actions/runs/13116361310
Job status: success. Test status: success.

Summary

Total 79 benchmarks in mean.
Geomean 99.804%.
Improved 10 Regressed 15 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 99.330%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_l0 SubmitKernel out of order	11.688000 μs	11.868 μs	101.54%	1.54%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.553000 μs	15.750 μs	101.27%	1.27%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104663.000000 instr	104663.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110006.000000 instr	110006.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122876.000000 instr	122876.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.091 μs	21.005000 μs	99.59%	-0.41%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.686 μs	1.679000 μs	99.58%	-0.42%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.322 μs	22.969000 μs	98.49%	-1.51%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.639 μs	11.418000 μs	98.10%	-1.90%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.627 μs	24.133000 μs	97.99%	-2.01%	-
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.157 μs	2.113000 μs	97.96%	-2.04%	-
api_overhead_benchmark_ur SubmitKernel in order	16.653 μs	16.241000 μs	97.53%	-2.47%	-

Relative perf in group memory (4): 99.370%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.162000 GB/s	3.158 GB/s	100.13%	0.13%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.594 μs	5.573000 μs	99.62%	-0.38%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	253.003 μs	251.872000 μs	99.55%	-0.45%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	134.918 μs	132.472000 μs	98.19%	-1.81%	.

Relative perf in group miscellaneous (1): 105.917%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	812.587000 bw GB/s	860.664 bw GB/s	105.92%	5.92%	+++

Relative perf in group multithread (10): 99.730%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2068.430000 μs	2093.086 μs	101.19%	1.19%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7433.408000 μs	7472.404 μs	100.52%	0.52%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25552.489000 μs	25587.435 μs	100.14%	0.14%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40877.062 μs	40846.653000 μs	99.93%	-0.07%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17175.787 μs	17154.077000 μs	99.87%	-0.13%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113014.974 μs	112790.682000 μs	99.80%	-0.20%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1204.627 μs	1201.865000 μs	99.77%	-0.23%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6958.898 μs	6939.950000 μs	99.73%	-0.27%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8723.371 μs	8689.121000 μs	99.61%	-0.39%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48485.934 μs	46935.372000 μs	96.80%	-3.20%	--

Relative perf in group graph (10): 100.664%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56469.890000 μs	57817.523 μs	102.39%	2.39%	+
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5595.119000 μs	5721.966 μs	102.27%	2.27%	+
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5615.528000 μs	5688.177 μs	101.29%	1.29%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.003000 μs	62.367 μs	100.59%	0.59%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72560.815000 μs	72642.878 μs	100.11%	0.11%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	673.827000 μs	674.284 μs	100.07%	0.07%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71727.394000 μs	71747.470 μs	100.03%	0.03%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353408.656000 μs	353502.721 μs	100.03%	0.03%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353371.260 μs	353339.946000 μs	99.99%	-0.01%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.612 μs	54.566000 μs	99.92%	-0.08%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (5): 101.332%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	285.187000 ns	306.767 ns	107.57%	7.57%	++++
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3130.450000 ns	3174.620 ns	101.41%	1.41%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2698.480000 ns	2735.530 ns	101.37%	1.37%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2174.040000 ns	2192.650 ns	100.86%	0.86%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2735.030 ns	2620.060000 ns	95.80%	-4.20%	--

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (5): 99.678%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	192.683000 ns	195.988 ns	101.72%	1.72%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	273.010 ns	271.315000 ns	99.38%	-0.62%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	215.383 ns	213.992000 ns	99.35%	-0.65%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	717.186 ns	710.790000 ns	99.11%	-0.89%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	719.890 ns	711.693000 ns	98.86%	-1.14%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (5): 100.739%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1223.930000 ns	1267.280 ns	103.54%	3.54%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1899.840000 ns	1936.480 ns	101.93%	1.93%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1217.490000 ns	1230.060 ns	101.03%	1.03%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3378.360000 ns	3386.980 ns	100.26%	0.26%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	260.917 ns	253.226000 ns	97.05%	-2.95%	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (5): 95.253%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	292.458000 ns	299.838 ns	102.52%	2.52%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.592000 ns	192.935 ns	100.70%	0.70%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	742.173 ns	727.999000 ns	98.09%	-1.91%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	746.594 ns	730.895000 ns	97.90%	-2.10%	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	260.879 ns	206.336000 ns	79.09%	-20.91%	----------

Relative perf in group alloc/min (6): 101.945%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	996.326000 ns	1128.250 ns	113.24%	13.24%	++++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	176.906000 ns	182.287 ns	103.04%	3.04%	+
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	174.627000 ns	177.227 ns	101.49%	1.49%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	967.803000 ns	968.189 ns	100.04%	0.04%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	857.112 ns	834.560000 ns	97.37%	-2.63%	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	831.810 ns	809.442000 ns	97.31%	-2.69%	-

Relative perf in group multiple (16): 99.347%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	140271.000000 ns	144859.000 ns	103.27%	3.27%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4152.220000 ns	4241.250 ns	102.14%	2.14%	+
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	158512.000000 ns	160647.000 ns	101.35%	1.35%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1165480.000000 ns	1181150.000 ns	101.34%	1.34%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	74983.800000 ns	75687.100 ns	100.94%	0.94%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	30770.300000 ns	31018.400 ns	100.81%	0.81%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15186.100000 ns	15279.900 ns	100.62%	0.62%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	41813.000 ns	41527.800000 ns	99.32%	-0.68%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140913.000 ns	139089.000000 ns	98.71%	-1.29%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	140543.000 ns	138580.000000 ns	98.60%	-1.40%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	31651.200 ns	31133.200000 ns	98.36%	-1.64%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	28511.500 ns	27865.300000 ns	97.73%	-2.27%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1195510.000 ns	1162710.000000 ns	97.26%	-2.74%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4330.310 ns	4200.920000 ns	97.01%	-2.99%	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	26003.600 ns	25041.800000 ns	96.30%	-3.70%	--
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31432.800 ns	30222.700000 ns	96.15%	-3.85%	--

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.375158 M keys/sec
Velocity-Bench Bitcracker	-	35.965200 s
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench QuickSilver	-	117.580000 MMS/CTT
Velocity-Bench Sobel Filter	-	611.944000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	824.202968 token/s
llama.cpp Text Generation Batched 128	-	62.990615 token/s
llama.cpp Prompt Processing Batched 256	-	870.375426 token/s
llama.cpp Text Generation Batched 256	-	62.990517 token/s
llama.cpp Prompt Processing Batched 512	-	429.991968 token/s
llama.cpp Text Generation Batched 512	-	62.959741 token/s

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-02-03T15:54:20Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13117455679

github-actions · 2025-02-03T16:36:59Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13117455679
Job status: success. Test status: success.

Summary

Total 90 benchmarks in mean.
Geomean 99.390%.
Improved 14 Regressed 14 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 99.148%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_ur SubmitKernel out of order	15.585000 μs	15.750 μs	101.06%	1.06%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.813000 μs	11.868 μs	100.47%	0.47%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104663.000000 instr	104663.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110006.000000 instr	110006.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122876.000000 instr	122876.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.080 μs	21.005000 μs	99.64%	-0.36%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.137 μs	2.113000 μs	98.88%	-1.12%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.425 μs	24.133000 μs	98.80%	-1.20%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.713 μs	1.679000 μs	98.02%	-1.98%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.472 μs	22.969000 μs	97.86%	-2.14%	-
api_overhead_benchmark_ur SubmitKernel in order	16.642 μs	16.241000 μs	97.59%	-2.41%	-
api_overhead_benchmark_l0 SubmitKernel in order	11.706 μs	11.418000 μs	97.54%	-2.46%	-

Relative perf in group memory (4): 99.520%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.198000 GB/s	3.158 GB/s	101.27%	1.27%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.619 μs	5.573000 μs	99.18%	-0.82%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	254.840 μs	251.872000 μs	98.84%	-1.16%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	134.056 μs	132.472000 μs	98.82%	-1.18%	.

Relative perf in group miscellaneous (1): 105.882%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	812.850000 bw GB/s	860.664 bw GB/s	105.88%	5.88%	+++

Relative perf in group multithread (10): 100.062%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2056.593000 μs	2093.086 μs	101.77%	1.77%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	111136.742000 μs	112790.682 μs	101.49%	1.49%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8626.433000 μs	8689.121 μs	100.73%	0.73%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7442.492000 μs	7472.404 μs	100.40%	0.40%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25506.903000 μs	25587.435 μs	100.32%	0.32%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6940.709 μs	6939.950000 μs	99.99%	-0.01%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1205.284 μs	1201.865000 μs	99.72%	-0.28%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	41092.553 μs	40846.653000 μs	99.40%	-0.60%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17272.728 μs	17154.077000 μs	99.31%	-0.69%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48112.388 μs	46935.372000 μs	97.55%	-2.45%	-

Relative perf in group graph (10): 100.579%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5580.336000 μs	5721.966 μs	102.54%	2.54%	+
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56471.937000 μs	57817.523 μs	102.38%	2.38%	+
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5596.135000 μs	5688.177 μs	101.64%	1.64%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353205.252000 μs	353502.721 μs	100.08%	0.08%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72619.075000 μs	72642.878 μs	100.03%	0.03%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71742.419000 μs	71747.470 μs	100.01%	0.01%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.581 μs	54.566000 μs	99.97%	-0.03%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353516.899 μs	353339.946000 μs	99.95%	-0.05%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.520 μs	62.367000 μs	99.76%	-0.24%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	677.834 μs	674.284000 μs	99.48%	-0.52%	.

Relative perf in group Velocity-Bench (9): 100.042%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Bitcracker	35.521300 s	35.965 s	101.25%	1.25%	.
Velocity-Bench QuickSilver	118.350000 MMS/CTT	117.580 MMS/CTT	100.65%	0.65%	.
Velocity-Bench dl-mnist	2.730 s	2.720000 s	99.63%	-0.37%	.
Velocity-Bench Sobel Filter	615.011 ms	611.944000 ms	99.50%	-0.50%	.
Velocity-Bench Hashtable	355.453 M keys/sec	358.375158 M keys/sec	99.18%	-0.82%	.
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group llama.cpp (6): 99.226%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Prompt Processing Batched 256	867.196 token/s	870.375426 token/s	99.63%	-0.37%	.
llama.cpp Prompt Processing Batched 128	820.005 token/s	824.202968 token/s	99.49%	-0.51%	.
llama.cpp Text Generation Batched 128	62.483 token/s	62.990615 token/s	99.19%	-0.81%	.
llama.cpp Text Generation Batched 256	62.482 token/s	62.990517 token/s	99.19%	-0.81%	.
llama.cpp Text Generation Batched 512	62.450 token/s	62.959741 token/s	99.19%	-0.81%	.
llama.cpp Prompt Processing Batched 512	424.217 token/s	429.991968 token/s	98.66%	-1.34%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (5): 102.024%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2093.460000 ns	2192.650 ns	104.74%	4.74%	++
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2663.560000 ns	2735.530 ns	102.70%	2.70%	+
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3109.090000 ns	3174.620 ns	102.11%	2.11%	+
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2604.620000 ns	2620.060 ns	100.59%	0.59%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	306.617000 ns	306.767 ns	100.05%	0.05%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (5): 99.117%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	271.318 ns	271.315000 ns	100.00%	-0.00%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	196.030 ns	195.988000 ns	99.98%	-0.02%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	718.486 ns	711.693000 ns	99.05%	-0.95%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	217.713 ns	213.992000 ns	98.29%	-1.71%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	723.248 ns	710.790000 ns	98.28%	-1.72%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (5): 96.005%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3317.410000 ns	3386.980 ns	102.10%	2.10%	+
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1919.300000 ns	1936.480 ns	100.90%	0.90%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	259.377 ns	253.226000 ns	97.63%	-2.37%	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1375.520 ns	1267.280000 ns	92.13%	-7.87%	----
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1397.420 ns	1230.060000 ns	88.02%	-11.98%	------

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (5): 91.420%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	291.382000 ns	299.838 ns	102.90%	2.90%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	193.823 ns	192.935000 ns	99.54%	-0.46%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	816.172 ns	730.895000 ns	89.55%	-10.45%	-----
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	818.649 ns	727.999000 ns	88.93%	-11.07%	-----
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	263.580 ns	206.336000 ns	78.28%	-21.72%	----------

Relative perf in group alloc/min (6): 101.387%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	979.688000 ns	1128.250 ns	115.16%	15.16%	+++++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	182.265000 ns	182.287 ns	100.01%	0.01%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	838.330 ns	834.560000 ns	99.55%	-0.45%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	981.010 ns	968.189000 ns	98.69%	-1.31%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	180.268 ns	177.227000 ns	98.31%	-1.69%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	829.074 ns	809.442000 ns	97.63%	-2.37%	-

Relative perf in group multiple (16): 100.056%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	137763.000000 ns	144859.000 ns	105.15%	5.15%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1132110.000000 ns	1181150.000 ns	104.33%	4.33%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	29159.200000 ns	30222.700 ns	103.65%	3.65%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	157077.000000 ns	160647.000 ns	102.27%	2.27%	+
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	30333.700000 ns	31018.400 ns	102.26%	2.26%	+
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4161.020000 ns	4241.250 ns	101.93%	1.93%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15132.200000 ns	15279.900 ns	100.98%	0.98%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1161470.000000 ns	1162710.000 ns	100.11%	0.11%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75688.200 ns	75687.100000 ns	100.00%	-0.00%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	138767.000 ns	138580.000000 ns	99.87%	-0.13%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4224.700 ns	4200.920000 ns	99.44%	-0.56%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140112.000 ns	139089.000000 ns	99.27%	-0.73%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	28135.400 ns	27865.300000 ns	99.04%	-0.96%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42394.300 ns	41527.800000 ns	97.96%	-2.04%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32503.900 ns	31133.200000 ns	95.78%	-4.22%	--
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	27852.800 ns	25041.800000 ns	89.91%	-10.09%	-----

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-mnist

Environment Variables:

pbalcer force-pushed the add-sycl-target-pvc branch from 8378842 to 27c2f7a Compare February 3, 2025 15:00

pbalcer force-pushed the add-sycl-target-pvc branch from 27c2f7a to f93adc4 Compare February 3, 2025 15:52

pbalcer changed the title ~~[benchmarks] add explicit sycl target for building benchmarks~~ [benchmarks] don't fail on suite setup issues Feb 3, 2025

[benchmarks] don't fail on suite setup issues

f93adc4

igchor approved these changes Feb 3, 2025

View reviewed changes

[benchmarks] don't fail on suite setup issues #2654

Are you sure you want to change the base?

[benchmarks] don't fail on suite setup issues #2654

Conversation

pbalcer commented Feb 3, 2025 • edited Loading

github-actions bot commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

Summary

Performance change in benchmark groups

Details

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

pbalcer commented Feb 3, 2025 •

edited

Loading