Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix matching begin with end callbacks on sampler #206

Merged
merged 5 commits into from
Sep 14, 2023

Conversation

vlkale
Copy link
Contributor

@vlkale vlkale commented Sep 7, 2023

Fix for sampler for matching. Essential for child callbacks and functionality of sampler utility with, e.g., space-time-stack.

@vlkale vlkale added the bug label Sep 7, 2023
@vlkale vlkale requested a review from crtrott September 7, 2023 19:11
@vlkale vlkale self-assigned this Sep 7, 2023
@vlkale vlkale marked this pull request as ready for review September 7, 2023 19:39
@vlkale
Copy link
Contributor Author

vlkale commented Sep 7, 2023

Running this fix on Perlmutter with stream benchmark under g++ - using 20 invocations using sampling skip of 101 for the nvtx connector. The output is the following, which seems to suggest this is working as expected since no actual calls are being made. Specifically, see the bold text.

vkale3@perlmutter:login39:~/kks/benchmarks/stream> nsys nvprof stream.cuda 
WARNING: stream.cuda and any of its children processes will be profiled.

-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/homes/v/vkale3/kto-inst-dir2/lib64/libkp_nvprof_connector.so
KokkosP: Loading child library ..
-----------------------------------------------------------
KokkosP: NVTX Analyzer Connector (sequence is 1, version: 20211015)
-----------------------------------------------------------
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       yes
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: (null)
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 101 calling child-begin function...
KokkosP: sample 101 calling child-end function...
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set               917720.61 MB/s
Copy             1244815.54 MB/s
Scale            1246769.11 MB/s
Add              1387866.46 MB/s
Triad            1389048.85 MB/s
-------------------------------------------------------------
-----------------------------------------------------------
KokkosP: Finalization of NVTX Connector. Complete.
-----------------------------------------------------------
Generating '/tmp/nsys-report-8811.qdstrm'
[1/7] [========================100%] report44.nsys-rep
[2/7] [========================100%] report44.sqlite
[3/7] Executing 'nvtxsum' stats report

NVTX Range Statistics:

 **Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)   Style   Range
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -------  -----
    100.0           55,073          1  55,073.0  55,073.0    55,073    55,073          0.0  PushPop  add**  

[4/7] Executing 'cudaapisum' stats report

CUDA API Statistics:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ----------------------
     51.2      480,295,783        118   4,070,303.2   1,300,278.5     1,222  337,006,312  30,916,572.7  cudaDeviceSynchronize 
     47.6      446,576,634          8  55,822,079.3  60,733,341.5    29,586  111,013,217  40,816,341.5  cudaMemcpy            
      0.3        3,126,454         10     312,645.4     158,728.0     1,002      770,875     304,597.9  cudaStreamSynchronize 
      0.3        2,815,030          8     351,878.8     129,959.0     5,560      887,825     412,313.0  cudaMalloc            
      0.2        1,964,504          8     245,563.0     109,495.0     2,966      766,978     292,534.6  cudaFree              
      0.1        1,027,617          2     513,808.5     513,808.5     8,506    1,019,111     714,605.6  cudaHostAlloc         
      0.1          686,971        102       6,735.0       5,775.5     5,200       38,682       4,621.7  cudaLaunchKernel      
      0.1          517,430          4     129,357.5      91,536.5    12,253      322,104     144,097.5  cudaMemcpyToSymbol    
      0.0          389,289          3     129,763.0       6,752.0     6,382      376,155     213,381.8  cudaFreeHost          
      0.0          117,983          7      16,854.7      14,677.0     7,645       41,488      11,344.2  cudaMemcpyAsync       
      0.0           47,810          3      15,936.7      14,888.0    13,525       19,397       3,073.3  cudaMemsetAsync       
      0.0           39,134          2      19,567.0      19,567.0    14,277       24,857       7,481.2  cudaStreamCreate      
      0.0           36,639          2      18,319.5      18,319.5    14,287       22,352       5,702.8  cudaMemset            
      0.0           10,550          1      10,550.0      10,550.0    10,550       10,550           0.0  cudaEventCreate       
      0.0            9,638          1       9,638.0       9,638.0     9,638        9,638           0.0  cudaMallocHost        
      0.0            9,308          2       4,654.0       4,654.0     4,118        5,190         758.0  cudaStreamDestroy     
      0.0            3,116          1       3,116.0       3,116.0     3,116        3,116           0.0  cudaEventDestroy      
      0.0            2,004          1       2,004.0       2,004.0     2,004        2,004           0.0  cuModuleGetLoadingMode

[5/7] Executing 'gpukernsum' stats report

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     26.3       28,723,687         20  1,436,184.4  1,436,250.0  1,433,450  1,439,018      1,559.1  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_add(Kokkos::…
     26.2       28,678,949         20  1,433,947.5  1,433,978.0  1,431,177  1,437,194      1,581.1  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_triad(Kokkos…
     18.1       19,797,097         20    989,854.9    990,006.5    987,591    992,263      1,284.8  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_copy(Kokkos:…
     18.1       19,786,538         20    989,326.9    989,159.0    987,239    992,039      1,254.2  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_scale(Kokkos…
     10.7       11,641,941         20    582,097.1    582,084.0    581,956    582,341        103.8  void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<perform_set(Kokkos::…
      0.6          631,845          1    631,845.0    631,845.0    631,845    631,845          0.0  desul::<unnamed>::init_lock_arrays_cuda_kernel()                                                    
      0.0            3,168          1      3,168.0      3,168.0      3,168      3,168          0.0  Kokkos::Impl::<unnamed>::query_cuda_kernel_arch(int *)                                              

[6/7] Executing 'gpumemtimesum' stats report

CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)       Operation     
 --------  ---------------  -----  ------------  ------------  --------  -----------  ------------  ------------------
     62.0      277,183,501      6  46,197,250.2  31,652,733.0     2,304  110,928,229  53,112,441.5  [CUDA memcpy DtoH]
     37.8      168,865,979     13  12,989,690.7       1,888.0     1,472   61,447,213  24,877,177.9  [CUDA memcpy HtoD]
      0.3        1,269,162      5     253,832.4     419,939.0     3,520      421,987     228,474.6  [CUDA memset]     

[7/7] Executing 'gpumemsizesum' stats report

CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
  2,400.893      5   480.179   800.000     0.008   800.000      437.934  [CUDA memset]     
  2,400.001     13   184.615     0.000     0.000   800.000      350.823  [CUDA memcpy HtoD]
  2,400.000      6   400.000   400.000     0.000   800.000      438.178  [CUDA memcpy DtoH]

Generated:
    /global/u1/v/vkale3/kks/benchmarks/stream/report44.nsys-rep
    /global/u1/v/vkale3/kks/benchmarks/stream/report44.sqlite
vkale3@perlmutter:login39:~/kks/benchmarks/stream> 

@vlkale
Copy link
Contributor Author

vlkale commented Sep 7, 2023

Here is a screenshot from Perlmutter for the same default stream the case when Sampler skip rate is set to 3. This looks correct given that every 3 samples are skipped and the 4th sample is taken. Given 4 different kernels of stream in the 20 outer iterations of the application, the number of samples is 20, and this is shown in the output. When the sampler skip rate is set to 7, the number of samples decreases to half of that of the skip rate of 3, as shown in the second diagram. This shows the sampling is being done correctly with the appropriate matching.

Screenshot 2023-09-07 at 12 54 12 PM Screenshot 2023-09-07 at 12 49 40 PM

@vlkale vlkale marked this pull request as draft September 7, 2023 20:05
vlkale and others added 4 commits September 7, 2023 15:05
Use invocation number counter to sample per kernel rather than across all possible kernel invocations
Fixing scan and reduce with matching on kokkosp_end using .end() unordered_map condition
@vlkale
Copy link
Contributor Author

vlkale commented Sep 7, 2023

I have put two updated sample outputs of stream based on the latest PR.

Here is a screenshot from Perlmutter for the same default stream in the case when Sampler skip rate is set to 3. This looks correct given that every 3 samples are skipped and the 4th sample is taken. Given 4 different kernels of stream in the 20 outer iterations of the application, the number of samples is 20, and this is shown in the output.

Uploading Screenshot 2023-09-07 at 3.53.31 PM.png…

When the sampler skip rate is set to 7, the number of samples decreases to half of that of the skip rate of 3, as shown in the second diagram. This shows the sampling is being done correctly with the appropriate matching.

Screenshot 2023-09-07 at 3 35 08 PM

The following is a completed run of the sampler on stream using develop branch's space-time-stack Kokkos Tools Connector. The output shows that the run has successfully completed. Note that this has been run on a CPU on Perlmutter.

Screenshot 2023-09-07 at 3 44 12 PM

@vlkale vlkale marked this pull request as ready for review September 7, 2023 22:55
@vlkale vlkale marked this pull request as draft September 7, 2023 22:55
@vlkale vlkale marked this pull request as ready for review September 13, 2023 16:31
@crtrott crtrott merged commit 200a1c0 into kokkos:develop Sep 14, 2023
5 checks passed
@vlkale vlkale deleted the fixMatchingInSampler branch October 27, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants