Fix matching begin with end callbacks on sampler #206

vlkale · 2023-09-07T19:11:15Z

Fix for sampler for matching. Essential for child callbacks and functionality of sampler utility with, e.g., space-time-stack.

vlkale · 2023-09-07T19:47:07Z

Running this fix on Perlmutter with stream benchmark under g++ - using 20 invocations using sampling skip of 101 for the nvtx connector. The output is the following, which seems to suggest this is working as expected since no actual calls are being made. Specifically, see the bold text.

vkale3@perlmutter:login39:~/kks/benchmarks/stream> nsys nvprof stream.cuda 
WARNING: stream.cuda and any of its children processes will be profiled.

-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/homes/v/vkale3/kto-inst-dir2/lib64/libkp_nvprof_connector.so
KokkosP: Loading child library ..
-----------------------------------------------------------
KokkosP: NVTX Analyzer Connector (sequence is 1, version: 20211015)
-----------------------------------------------------------
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       yes
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: (null)
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 101 calling child-begin function...
KokkosP: sample 101 calling child-end function...
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set               917720.61 MB/s
Copy             1244815.54 MB/s
Scale            1246769.11 MB/s
Add              1387866.46 MB/s
Triad            1389048.85 MB/s
-------------------------------------------------------------
-----------------------------------------------------------
KokkosP: Finalization of NVTX Connector. Complete.
-----------------------------------------------------------
Generating '/tmp/nsys-report-8811.qdstrm'
[1/7] [========================100%] report44.nsys-rep
[2/7] [========================100%] report44.sqlite
[3/7] Executing 'nvtxsum' stats report

NVTX Range Statistics:

 **Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)   Style   Range
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -------  -----
    100.0           55,073          1  55,073.0  55,073.0    55,073    55,073          0.0  PushPop  add**  

[4/7] Executing 'cudaapisum' stats report

CUDA API Statistics:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ----------------------
     51.2      480,295,783        118   4,070,303.2   1,300,278.5     1,222  337,006,312  30,916,572.7  cudaDeviceSynchronize 
     47.6      446,576,634          8  55,822,079.3  60,733,341.5    29,586  111,013,217  40,816,341.5  cudaMemcpy            
      0.3        3,126,454         10     312,645.4     158,728.0     1,002      770,875     304,597.9  cudaStreamSynchronize 
      0.3        2,815,030          8     351,878.8     129,959.0     5,560      887,825     412,313.0  cudaMalloc            
      0.2        1,964,504          8     245,563.0     109,495.0     2,966      766,978     292,534.6  cudaFree              
      0.1        1,027,617          2     513,808.5     513,808.5     8,506    1,019,111     714,605.6  cudaHostAlloc         
      0.1          686,971        102       6,735.0       5,775.5     5,200       38,682       4,621.7  cudaLaunchKernel      
      0.1          517,430          4     129,357.5      91,536.5    12,253      322,104     144,097.5  cudaMemcpyToSymbol    
      0.0          389,289          3     129,763.0       6,752.0     6,382      376,155     213,381.8  cudaFreeHost          
      0.0          117,983          7      16,854.7      14,677.0     7,645       41,488      11,344.2  cudaMemcpyAsync       
      0.0           47,810          3      15,936.7      14,888.0    13,525       19,397       3,073.3  cudaMemsetAsync       
      0.0           39,134          2      19,567.0      19,567.0    14,277       24,857       7,481.2  cudaStreamCreate      
      0.0           36,639          2      18,319.5      18,319.5    14,287       22,352       5,702.8  cudaMemset            
      0.0           10,550          1      10,550.0      10,550.0    10,550       10,550           0.0  cudaEventCreate       
      0.0            9,638          1       9,638.0       9,638.0     9,638        9,638           0.0  cudaMallocHost        
      0.0            9,308          2       4,654.0       4,654.0     4,118        5,190         758.0  cudaStreamDestroy     
      0.0            3,116          1       3,116.0       3,116.0     3,116        3,116           0.0  cudaEventDestroy      
      0.0            2,004          1       2,004.0       2,004.0     2,004        2,004           0.0  cuModuleGetLoadingMode

[5/7] Executing 'gpukernsum' stats report

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     26.3       28,723,687         20  1,436,184.4  1,436,250.0  1,433,450  1,439,018      1,559.1  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_add(Kokkos::…
     26.2       28,678,949         20  1,433,947.5  1,433,978.0  1,431,177  1,437,194      1,581.1  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_triad(Kokkos…
     18.1       19,797,097         20    989,854.9    990,006.5    987,591    992,263      1,284.8  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_copy(Kokkos:…
     18.1       19,786,538         20    989,326.9    989,159.0    987,239    992,039      1,254.2  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_scale(Kokkos…
     10.7       11,641,941         20    582,097.1    582,084.0    581,956    582,341        103.8  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_set(Kokkos::…
      0.6          631,845          1    631,845.0    631,845.0    631,845    631,845          0.0  desul::<unnamed>::init_lock_arrays_cuda_kernel()                                                    
      0.0            3,168          1      3,168.0      3,168.0      3,168      3,168          0.0  Kokkos::Impl::<unnamed>::query_cuda_kernel_arch(int *)                                              

[6/7] Executing 'gpumemtimesum' stats report

CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)       Operation     
 --------  ---------------  -----  ------------  ------------  --------  -----------  ------------  ------------------
     62.0      277,183,501      6  46,197,250.2  31,652,733.0     2,304  110,928,229  53,112,441.5  [CUDA memcpy DtoH]
     37.8      168,865,979     13  12,989,690.7       1,888.0     1,472   61,447,213  24,877,177.9  [CUDA memcpy HtoD]
      0.3        1,269,162      5     253,832.4     419,939.0     3,520      421,987     228,474.6  [CUDA memset]     

[7/7] Executing 'gpumemsizesum' stats report

CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
  2,400.893      5   480.179   800.000     0.008   800.000      437.934  [CUDA memset]     
  2,400.001     13   184.615     0.000     0.000   800.000      350.823  [CUDA memcpy HtoD]
  2,400.000      6   400.000   400.000     0.000   800.000      438.178  [CUDA memcpy DtoH]

Generated:
    /global/u1/v/vkale3/kks/benchmarks/stream/report44.nsys-rep
    /global/u1/v/vkale3/kks/benchmarks/stream/report44.sqlite
vkale3@perlmutter:login39:~/kks/benchmarks/stream>

vlkale · 2023-09-07T20:05:19Z

Here is a screenshot from Perlmutter for the same default stream the case when Sampler skip rate is set to 3. This looks correct given that every 3 samples are skipped and the 4th sample is taken. Given 4 different kernels of stream in the 20 outer iterations of the application, the number of samples is 20, and this is shown in the output. When the sampler skip rate is set to 7, the number of samples decreases to half of that of the skip rate of 3, as shown in the second diagram. This shows the sampling is being done correctly with the appropriate matching.

Use invocation number counter to sample per kernel rather than across all possible kernel invocations

Fixing scan and reduce with matching on kokkosp_end using .end() unordered_map condition

vlkale · 2023-09-07T22:48:35Z

I have put two updated sample outputs of stream based on the latest PR.

Here is a screenshot from Perlmutter for the same default stream in the case when Sampler skip rate is set to 3. This looks correct given that every 3 samples are skipped and the 4th sample is taken. Given 4 different kernels of stream in the 20 outer iterations of the application, the number of samples is 20, and this is shown in the output.

When the sampler skip rate is set to 7, the number of samples decreases to half of that of the skip rate of 3, as shown in the second diagram. This shows the sampling is being done correctly with the appropriate matching.

The following is a completed run of the sampler on stream using develop branch's space-time-stack Kokkos Tools Connector. The output shows that the run has successfully completed. Note that this has been run on a CPU on Perlmutter.

put in map for kID to nested kid and use matching

2454cdd

vlkale added the bug label Sep 7, 2023

vlkale requested a review from crtrott September 7, 2023 19:11

vlkale self-assigned this Sep 7, 2023

vlkale marked this pull request as ready for review September 7, 2023 19:39

vlkale marked this pull request as draft September 7, 2023 20:05

vlkale and others added 4 commits September 7, 2023 15:05

Put in invocation number counter

761fcc9

Use invocation number counter to sample per kernel rather than across all possible kernel invocations

fixed matching of kp_sampler with kokkosp_end check

4b3bda1

Fixing scan and reduce with matching on kokkosp_end

249b072

Fixing scan and reduce with matching on kokkosp_end using .end() unordered_map condition

applied clang format

8582e73

vlkale marked this pull request as ready for review September 7, 2023 22:55

vlkale marked this pull request as draft September 7, 2023 22:55

vlkale marked this pull request as ready for review September 13, 2023 16:31

crtrott approved these changes Sep 14, 2023

View reviewed changes

crtrott merged commit 200a1c0 into kokkos:develop Sep 14, 2023
5 checks passed

vlkale deleted the fixMatchingInSampler branch October 27, 2023 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix matching begin with end callbacks on sampler #206

Fix matching begin with end callbacks on sampler #206

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023 •

edited

Loading

Fix matching begin with end callbacks on sampler #206

Fix matching begin with end callbacks on sampler #206

Conversation

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023

vlkale commented Sep 7, 2023 • edited Loading

vlkale commented Sep 7, 2023 •

edited

Loading