Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernel time tracing support for gpu #381

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhczhong
Copy link
Member

Use pti-gpu to trace the OpenCL kernel execution and print the trace result after the program is finished. The feature is default OFF and could be enable by setting -DGC_ENABLE_GPU_PROFILE=ON.

Example:

./bin/gc-gpu-runner --shared-libs=../externals/llvm-project/build/lib/libmlir_runner_utils.so ../test/mlir/test/gc/gpu-runner/XeGPU/f16_matmul_128x64_transpose.mlir

Unranked Memref base@ = 0x556fbba18bc0 rank = 1 offset = 0 sizes = [32] strides = [64] data = 
[5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719,  5.11719]

=== API Timing Results: ===

             Total Execution Time (ns):             63061619
Total API Time for CL CPU backend (ns):                 3418
Total API Time for CL GPU backend (ns):              1952063

== CL CPU Backend: ==

      Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
clGetDeviceIDs,           1,                3418,    100.00,                3418,                3418,                3418

== CL GPU Backend: ==

                                Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
                    clEnqueueMemcpyINTEL,           4,             1111814,     56.96,              277953,               12439,              877623
                  clEnqueueNDRangeKernel,           1,              181885,      9.32,              181885,              181885,              181885
                          clBuildProgram,           1,              166584,      8.53,              166584,              166584,              166584
                         clWaitForEvents,           2,              142960,      7.32,               71480,               18542,              124418
      clCreateCommandQueueWithProperties,           1,              113511,      5.81,              113511,              113511,              113511
                   clSharedMemAllocINTEL,           3,              106677,      5.46,               35559,               14840,               70172
                          clMemFreeINTEL,           3,               71023,      3.64,               23674,               15496,               39005
                         clCreateContext,           1,               15468,      0.79,               15468,               15468,               15468
                        clReleaseProgram,           1,                7996,      0.41,                7996,                7996,                7996
           clSetKernelArgMemPointerINTEL,           3,                6909,      0.35,                2303,                1240,                4271
                          clCreateKernel,           1,                6202,      0.32,                6202,                6202,                6202
                   clCreateProgramWithIL,           1,                5263,      0.27,                5263,                5263,                5263
                          clReleaseEvent,           5,                2464,      0.13,                 492,                 214,                1222
clGetExtensionFunctionAddressForPlatform,           6,                2361,      0.12,                 393,                 233,                 660
                         clReleaseKernel,           2,                2077,      0.11,                1038,                 552,                1525
                          clGetDeviceIDs,           2,                1922,      0.10,                 961,                 206,                1716
                          clSetKernelArg,           7,                1670,      0.09,                 238,                  53,                1038
                           clCloneKernel,           1,                1485,      0.08,                1485,                1485,                1485
                   clReleaseCommandQueue,           1,                1332,      0.07,                1332,                1332,                1332
                         clGetDeviceInfo,           4,                 979,      0.05,                 244,                  64,                 527
                   clGetCommandQueueInfo,           4,                 937,      0.05,                 234,                  49,                 550
                     clSetKernelExecInfo,           3,                 544,      0.03,                 181,                  69,                 405


=== Device Timing Results: ===

                Total Execution Time (ns):             63061619
Total Device Time for CL GPU backend (ns):                95680

== CL GPU Backend: ==

              Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
linalg_matmul_kernel,           1,               95680,      100.00,               95680,               95680,               95680

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTI has not yet released OpenCL support as of now for its SDK. The current estimation is that it should take about a month to complete. The API will follow the existing one for L0. There will be a proper library to link against as well.

README.md Show resolved Hide resolved
cmake/ptigpu.cmake Outdated Show resolved Hide resolved
lib/gc/ExecutionEngine/GPURuntime/ocl/GpuOclRuntime.cpp Outdated Show resolved Hide resolved
lib/gc/ExecutionEngine/GPURuntime/ocl/CMakeLists.txt Outdated Show resolved Hide resolved
@zhczhong
Copy link
Member Author

zhczhong commented Oct 16, 2024

PTI has not yet released OpenCL support as of now for its SDK. The current estimation is that it should take about a month to complete. The API will follow the existing one for L0. There will be a proper library to link against as well.

Thanks for the information! We could switch to use the SDK API when the OpenCL support is ready

if(GC_ENABLE_GPU_PROFILE)
include(ptigpu)
get_property(GC_PTIGPU_BINARY_DIR GLOBAL PROPERTY GC_PTIGPU_BINARY_DIR)
target_link_libraries(GcGpuOclRuntime PRIVATE ${GC_PTIGPU_BINARY_DIR}/lib/libonetrace_tool.so)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_link_libraries(GcGpuOclRuntime PRIVATE ${GC_PTIGPU_BINARY_DIR}/lib/libonetrace_tool.so)
target_link_libraries(GcGpuOclRuntime PRIVATE onetrace_tool)

Should this work?

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhczhong, would it be possible to put the changes inside PTI instead of a fork? A branch maybe?

@zhczhong
Copy link
Member Author

zhczhong commented Oct 31, 2024

@zhczhong, would it be possible to put the changes inside PTI instead of a fork? A branch maybe?

I tried it but don't have written permission for the PTI repo and cannot create a branch for it. Do you have any suggestion?

@kurapov-peter
Copy link
Contributor

Here's a public branch for the PTI's interface https://github.com/intel/pti-gpu/tree/exp_opencl_0.11.0. @zhczhong, could you please revive this?

@zhczhong
Copy link
Member Author

以下是 PTI 接口 https://github.com/intel/pti-gpu/tree/exp_opencl_0.11.0 的公共分支。,您能恢复一下吗?

ok! I will switch the profiling support to the one based on the sdk

@zhczhong zhczhong force-pushed the zhicong/gpu_profile branch from 3536d6f to 19dbdda Compare November 29, 2024 02:45
@zhczhong
Copy link
Member Author

This PR depends on intel/pti-gpu#80 for a correct compilation

@zhczhong zhczhong force-pushed the zhicong/gpu_profile branch 4 times, most recently from 4b9bb13 to 188fef9 Compare November 29, 2024 03:25
@zhczhong zhczhong force-pushed the zhicong/gpu_profile branch from 188fef9 to 78a1f3c Compare December 17, 2024 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants