Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fence on sample only #209

Merged
merged 13 commits into from
Oct 12, 2023
Merged

Fence on sample only #209

merged 13 commits into from
Oct 12, 2023

Conversation

vlkale
Copy link
Contributor

@vlkale vlkale commented Sep 14, 2023

This PR fences only when a sample event is taken, i.e., at the beginning of the sample in kokkosp_begin_xyz( mykID) and at the end of the corresponding sample in kokkosp_end_xyz(mykID). This improves efficiency of Kokkos Tools, when sampling is done.

Note that provide tools programming interface must be exposed in profiling/all/kp_core.hpp. This wasn't done previously, i.e., it is not in the develop branch of Kokkos Tools, and it is useful for other tools needing the tools programming interface.

Notes from PR #194 are relevant to this PR.

vlkale and others added 8 commits September 13, 2023 14:41
Putting in tool_invoked_fence code.
Fixing  tool induced fences to always fence on device with DevID 0. 

Fencing with DevID will be a done in subsequent patch (where Pair object will be used in the hash table to capture the begin sample's information. 

Note that the pair/tuple object can capture other state information to store between the beginning of sampling event and ending of it.
@vlkale vlkale requested a review from crtrott September 14, 2023 18:32
@vlkale vlkale marked this pull request as ready for review September 14, 2023 18:32
@vlkale
Copy link
Contributor Author

vlkale commented Sep 14, 2023

Output for stream with Kokkos CUDA backend on Perlmutter with sampler, having fences on, when Kernel logger is sampled. The output shows the change gives correct behavior of the sampler.

vkale3@perlmutter:login32:~/kks/benchmarks/stream> export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_SAMPLER_VERBOSE=1; export KOKKOS_TOOLS_LIBS="/global/homes/v/vkale3/kto-dev9152023/common/kokkos-sampler/kp_sampler.so;/global/homes/v/vkale3/kto3-install/lib64/libkp_kernel_logger.so"; export KOKKOS_TOOLS_SAMPLER_SKIP=7; ./stream.cuda; 
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/homes/v/vkale3/kto3-install/lib64/libkp_kernel_logger.so
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 7
KokkosP: finished kptpi
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 8 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 0
KokkosP:     set
KokkosP: sample 8 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 16 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 1
KokkosP:     add
KokkosP: sample 16 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
KokkosP: sample 24 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 2
KokkosP:     copy
KokkosP: sample 24 calling child-end function...
KokkosP: Execution of kernel 2 is completed.
KokkosP: sample 32 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 3
KokkosP:     triad
KokkosP: sample 32 calling child-end function...
KokkosP: Execution of kernel 3 is completed.
KokkosP: sample 40 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 4
KokkosP:     scale
KokkosP: sample 40 calling child-end function...
KokkosP: Execution of kernel 4 is completed.
KokkosP: sample 48 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 5
KokkosP:     set
KokkosP: sample 48 calling child-end function...
KokkosP: Execution of kernel 5 is completed.
KokkosP: sample 56 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 6
KokkosP:     add
KokkosP: sample 56 calling child-end function...
KokkosP: Execution of kernel 6 is completed.
KokkosP: sample 64 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 7
KokkosP:     copy
KokkosP: sample 64 calling child-end function...
KokkosP: Execution of kernel 7 is completed.
KokkosP: sample 72 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 8
KokkosP:     triad
KokkosP: sample 72 calling child-end function...
KokkosP: Execution of kernel 8 is completed.
KokkosP: sample 80 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 9
KokkosP:     scale
KokkosP: sample 80 calling child-end function...
KokkosP: Execution of kernel 9 is completed.
KokkosP: sample 88 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 10
KokkosP:     set
KokkosP: sample 88 calling child-end function...
KokkosP: Execution of kernel 10 is completed.
KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 11
KokkosP:     add
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 11 is completed.
KokkosP: sample 104 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 12
KokkosP:     copy
KokkosP: sample 104 calling child-end function...
KokkosP: Execution of kernel 12 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set              1359462.20 MB/s
Copy             1375477.98 MB/s
Scale            1377221.95 MB/s
Add              1381316.43 MB/s
Triad            1392428.78 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only read the environment variable during Init

common/kokkos-sampler/kp_sampler_skip.cpp Outdated Show resolved Hide resolved
Passing devID to invoke_ktools_febce() instead of 0 is in a separate PR. Checking fence is done only on devID hasn't been tested in this PR and isn't directly related to this PR.
@vlkale
Copy link
Contributor Author

vlkale commented Oct 5, 2023

Here is another test with the globFence check at runtime taken out, i.e., the latest change as advised by @crtrott. The sampler skip rate is set to 7. This shows output of Kokkos stream with the CUDA backend, using the kokkos sampler applied to the kernel logger, on Perlmutter. Note that the device being printed out is not the physical device ID (on, e.g., a node of supercomputer) but a Kokkos execution space identifier.

vkale3@perlmutter:login12:~/kks/benchmarks/stream> export KOKKOS_TOOLS_SAMPLER_SKIP=2; export KOKKOS_TOOLS_SAMPLER_VERBOSE=1; export KOKKOS_TOOLS_LIBS="/global/homes/v/vkale3/kto-dev-vlk/common/kokkos-sampler/kp_sampler.so;/global/homes/v/vkale3/kinst06/lib64/libkp_kernel_logger.so;/global/homes/v/vkale3/kinst06/lib64/libkp_kernel_timer.so;/global/homes/v/vkale3/kinst06/lib64/libkp_memory_usage.so;"; ./stream.cuda 
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/homes/v/vkale3/kinst06/lib64/libkp_kernel_logger.so
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 2
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
KokkosP: sample 3 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 0
KokkosP:     Kokkos::View::initialization [c] via memset
KokkosP: sample 3 calling child-end function...
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 6 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 1 with unique execution identifier 1
KokkosP:     Kokkos::View::initialization [c_mirror] via memset
KokkosP: sample 6 calling child-end function...
KokkosP: Execution of kernel 1 is completed.
Initializing Views...
Starting benchmarking...
KokkosP: sample 9 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 2
KokkosP:     copy
KokkosP: sample 9 calling child-end function...
KokkosP: Execution of kernel 2 is completed.
KokkosP: sample 12 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 3
KokkosP:     triad
KokkosP: sample 12 calling child-end function...
KokkosP: Execution of kernel 3 is completed.
KokkosP: sample 15 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 4
KokkosP:     scale
KokkosP: sample 15 calling child-end function...
KokkosP: Execution of kernel 4 is completed.
KokkosP: sample 18 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 5
KokkosP:     set
KokkosP: sample 18 calling child-end function...
KokkosP: Execution of kernel 5 is completed.
KokkosP: sample 21 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 6
KokkosP:     add
KokkosP: sample 21 calling child-end function...
KokkosP: Execution of kernel 6 is completed.
KokkosP: sample 24 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 7
KokkosP:     copy
KokkosP: sample 24 calling child-end function...
KokkosP: Execution of kernel 7 is completed.
KokkosP: sample 27 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 8
KokkosP:     triad
KokkosP: sample 27 calling child-end function...
KokkosP: Execution of kernel 8 is completed.
KokkosP: sample 30 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 9
KokkosP:     scale
KokkosP: sample 30 calling child-end function...
KokkosP: Execution of kernel 9 is completed.
KokkosP: sample 33 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 10
KokkosP:     set
KokkosP: sample 33 calling child-end function...
KokkosP: Execution of kernel 10 is completed.
KokkosP: sample 36 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 11
KokkosP:     add
KokkosP: sample 36 calling child-end function...
KokkosP: Execution of kernel 11 is completed.
KokkosP: sample 39 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 12
KokkosP:     copy
KokkosP: sample 39 calling child-end function...
KokkosP: Execution of kernel 12 is completed.
KokkosP: sample 42 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 13
KokkosP:     triad
KokkosP: sample 42 calling child-end function...
KokkosP: Execution of kernel 13 is completed.
KokkosP: sample 45 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 14
KokkosP:     scale
KokkosP: sample 45 calling child-end function...
KokkosP: Execution of kernel 14 is completed.
KokkosP: sample 48 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 15
KokkosP:     set
KokkosP: sample 48 calling child-end function...
KokkosP: Execution of kernel 15 is completed.
KokkosP: sample 51 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 16
KokkosP:     add
KokkosP: sample 51 calling child-end function...
KokkosP: Execution of kernel 16 is completed.
KokkosP: sample 54 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 17
KokkosP:     copy
KokkosP: sample 54 calling child-end function...
KokkosP: Execution of kernel 17 is completed.
KokkosP: sample 57 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 18
KokkosP:     triad
KokkosP: sample 57 calling child-end function...
KokkosP: Execution of kernel 18 is completed.
KokkosP: sample 60 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 19
KokkosP:     scale
KokkosP: sample 60 calling child-end function...
KokkosP: Execution of kernel 19 is completed.
KokkosP: sample 63 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 20
KokkosP:     set
KokkosP: sample 63 calling child-end function...
KokkosP: Execution of kernel 20 is completed.
KokkosP: sample 66 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 21
KokkosP:     add
KokkosP: sample 66 calling child-end function...
KokkosP: Execution of kernel 21 is completed.
KokkosP: sample 69 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 22
KokkosP:     copy
KokkosP: sample 69 calling child-end function...
KokkosP: Execution of kernel 22 is completed.
KokkosP: sample 72 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 23
KokkosP:     triad
KokkosP: sample 72 calling child-end function...
KokkosP: Execution of kernel 23 is completed.
KokkosP: sample 75 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 24
KokkosP:     scale
KokkosP: sample 75 calling child-end function...
KokkosP: Execution of kernel 24 is completed.
KokkosP: sample 78 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 25
KokkosP:     set
KokkosP: sample 78 calling child-end function...
KokkosP: Execution of kernel 25 is completed.
KokkosP: sample 81 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 26
KokkosP:     add
KokkosP: sample 81 calling child-end function...
KokkosP: Execution of kernel 26 is completed.
KokkosP: sample 84 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 27
KokkosP:     copy
KokkosP: sample 84 calling child-end function...
KokkosP: Execution of kernel 27 is completed.
KokkosP: sample 87 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 28
KokkosP:     triad
KokkosP: sample 87 calling child-end function...
KokkosP: Execution of kernel 28 is completed.
KokkosP: sample 90 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 29
KokkosP:     scale
KokkosP: sample 90 calling child-end function...
KokkosP: Execution of kernel 29 is completed.
KokkosP: sample 93 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 30
KokkosP:     set
KokkosP: sample 93 calling child-end function...
KokkosP: Execution of kernel 30 is completed.
KokkosP: sample 96 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 31
KokkosP:     add
KokkosP: sample 96 calling child-end function...
KokkosP: Execution of kernel 31 is completed.
KokkosP: sample 99 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 32
KokkosP:     copy
KokkosP: sample 99 calling child-end function...
KokkosP: Execution of kernel 32 is completed.
KokkosP: sample 102 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 33
KokkosP:     triad
KokkosP: sample 102 calling child-end function...
KokkosP: Execution of kernel 33 is completed.
KokkosP: sample 105 calling child-begin function...
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 34
KokkosP:     scale
KokkosP: sample 105 calling child-end function...
KokkosP: Execution of kernel 34 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set              1001874.76 MB/s
Copy             1373499.02 MB/s
Scale            1373936.59 MB/s
Add              1395409.68 MB/s
Triad            1397289.61 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

@vlkale
Copy link
Contributor Author

vlkale commented Oct 5, 2023

This is the same run as the previous post, but with the KOKKOS_TOOLS_SAMPLER_VERBOSE set to 2 instead of 1. This shows the invocation of the Kokkos Tools tool-induced fence via the print, insta. The print statement for this fence shows the physical device ID (converted from the execution space ID). We see from the output that the device ID is 0. This is correct, given each begin/end tools callback invokes a tool-induced fence using the parameter 0.

vkale3@perlmutter:login12:~/kks/benchmarks/stream> export KOKKOS_TOOLS_SAMPLER_SKIP=7; export KOKKOS_TOOLS_SAMPLER_VERBOSE=2; export KOKKOS_TOOLS_GLOBALFENCES=1; export KOKKOS_TOOLS_LIBS="/global/homes/v/vkale3/kto-dev-vlk/common/kokkos-sampler/kp_sampler.so;/global/homes/v/vkale3/kinst06/lib64/libkp_kernel_logger.so;/global/homes/v/vkale3/kinst06/lib64/libkp_kernel_timer.so;/global/homes/v/vkale3/kinst06/lib64/libkp_memory_usage.so;"; ./stream.cuda 
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Next library to call: /global/homes/v/vkale3/kinst06/lib64/libkp_kernel_logger.so
KokkosP: Loading child library ..
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Function Status:
KokkosP: begin-parallel-for:      yes
KokkosP: begin-parallel-scan:     yes
KokkosP: begin-parallel-reduce:   yes
KokkosP: end-parallel-for:        yes
KokkosP: end-parallel-scan:       no
KokkosP: end-parallel-reduce:     yes
KokkosP: Sampling rate set to: 7
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 20 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: sample 8 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 0
KokkosP:     set
KokkosP: sample 8 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 0 is completed.
KokkosP: sample 16 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 1
KokkosP:     add
KokkosP: sample 16 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 1 is completed.
KokkosP: sample 24 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 2
KokkosP:     copy
KokkosP: sample 24 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 2 is completed.
KokkosP: sample 32 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 3
KokkosP:     triad
KokkosP: sample 32 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 3 is completed.
KokkosP: sample 40 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 4
KokkosP:     scale
KokkosP: sample 40 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 4 is completed.
KokkosP: sample 48 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 5
KokkosP:     set
KokkosP: sample 48 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 5 is completed.
KokkosP: sample 56 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 6
KokkosP:     add
KokkosP: sample 56 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 6 is completed.
KokkosP: sample 64 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 7
KokkosP:     copy
KokkosP: sample 64 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 7 is completed.
KokkosP: sample 72 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 8
KokkosP:     triad
KokkosP: sample 72 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 8 is completed.
KokkosP: sample 80 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 9
KokkosP:     scale
KokkosP: sample 80 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 9 is completed.
KokkosP: sample 88 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 10
KokkosP:     set
KokkosP: sample 88 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 10 is completed.
KokkosP: sample 96 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 11
KokkosP:     add
KokkosP: sample 96 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 11 is completed.
KokkosP: sample 104 calling child-begin function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Executing parallel-for kernel on device 33554433 with unique execution identifier 12
KokkosP:     copy
KokkosP: sample 104 calling child-end function...
KokkosP: Sampler utility sucessfully invoked  tool-induced fence on device 0
KokkosP: Execution of kernel 12 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set              1359342.08 MB/s
Copy             1374741.70 MB/s
Scale            1375534.74 MB/s
Add              1399208.05 MB/s
Triad            1402336.88 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.
vkale3@perlmutter:login12:~/kks/benchmarks/stream> 

@crtrott crtrott merged commit 2ddedef into kokkos:develop Oct 12, 2023
7 checks passed
@vlkale vlkale deleted the fenceOnSampleOnly branch October 27, 2023 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants