Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for randomized sampling #181

Closed
wants to merge 24 commits into from

Conversation

vlkale
Copy link
Contributor

@vlkale vlkale commented Apr 21, 2023

Fix #180.

The sampler will allow user to use either periodic sampling or random sampling via environment variable or kokkos-tools-args.

The solution should allow for possibly a combination of both (e.g., every 20th invocation of a Kokkos::parallel_for, gather time spent with probability 63%).

Added tool_random_mode and tool_periodic_mode to identify whether tool uses periodic sampling or random sampling (or possibly a combination of both (every 20th timestep, gather data with 50% probability).
@vlkale vlkale requested a review from crtrott April 21, 2023 01:36
@vlkale vlkale self-assigned this Apr 21, 2023
@vlkale vlkale marked this pull request as ready for review July 20, 2023 20:31
@vlkale
Copy link
Contributor Author

vlkale commented Aug 5, 2023

Sample output of two independent runs showing sampling of Kernel_logger. The skip rate of sampler is set to 0 (every Kokkos kernel invocation is profiled/logged). The sampler probability, the new environment variable and feature in this PR, is set to 1.0%, and this means that on every kernel invocation, there is a 1% chance that the kernel will be logged. The probability of logging for a kernel invocation is independent of any other kernel invocation.

The result of the two different runs shows two different numbers of samples, showing that the sampling is non-deterministic and random. The number of samples is roughly right: half of the 600 sets of 4 kernel invocations of stream , i.e., 600*4/2 = 1200 kernels will be logged at maximum given the skip rate of 1; then, applying the probability of 1% to this number 1200 is 12. We see that the sampler outputs on the order of 12 logs in the output below. More runs might show that the sampler sometimes outputs something like 14 samples, or 15 samples.

**user@system stream %** export KOKKOS_TOOLS_SAMPLER_SKIP=1; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=1; export KOKKOS_TOOLS_LIBS="${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/common/kokkos-sampler/libkp_kokkos_sampler.dylib;${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/debugging/kernel-logger/libkp_kernel_logger.dylib" ; ./stream600iters.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with the specified sampling probability applied to the specified sampling skip rate set.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 600 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     copy
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     scale
KokkosP: Execution of kernel 1 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     scale
KokkosP: Execution of kernel 2 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 3
KokkosP:     scale
KokkosP: Execution of kernel 3 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 4
KokkosP:     scale
KokkosP: Execution of kernel 4 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 5
KokkosP:     scale
KokkosP: Execution of kernel 5 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 6
KokkosP:     add
KokkosP: Execution of kernel 6 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 7
KokkosP:     add
KokkosP: Execution of kernel 7 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 8
KokkosP:     set
KokkosP: Execution of kernel 8 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 9
KokkosP:     set
KokkosP: Execution of kernel 9 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 10
KokkosP:     set
KokkosP: Execution of kernel 10 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62636.41 MB/s
Copy               74484.72 MB/s
Scale              74474.17 MB/s
Add                83036.60 MB/s
Triad              82634.90 MB/s
-----------------------------------------------------------

**user@system stream** % export KOKKOS_TOOLS_SAMPLER_SKIP=1; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=1; export KOKKOS_TOOLS_LIBS="/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/kto-dev-vk/vbuild/common/kokkos-sampler/libkp_kokkos_sampler.dylib;/Users/vlkale/Desktop/vlap/wk/code/softwareTech/ktools/kto-dev-vk/vbuild/debugging/kernel-logger/libkp_kernel_logger.dylib" ; ./stream600iters.exe 
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with the specified sampling probability applied to the specified sampling skip rate set.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 600 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     set
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     add
KokkosP: Execution of kernel 1 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     triad
KokkosP: Execution of kernel 2 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 3
KokkosP:     scale
KokkosP: Execution of kernel 3 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 4
KokkosP:     scale
KokkosP: Execution of kernel 4 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 5
KokkosP:     add
KokkosP: Execution of kernel 5 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 6
KokkosP:     scale
KokkosP: Execution of kernel 6 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 7
KokkosP:     copy
KokkosP: Execution of kernel 7 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 8
KokkosP:     add
KokkosP: Execution of kernel 8 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62635.39 MB/s
Copy               74936.19 MB/s
Scale              74629.77 MB/s
Add                83065.58 MB/s
Triad              82659.57 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

@vlkale
Copy link
Contributor Author

vlkale commented Aug 5, 2023

These two outputs from two different runs both using no skipping of any logging/profiling should be done, unlike the previous case where every other kernel invocation was logged/profiling (changingexport KOKKOS_TOOLS_SAMPLER_SKIP=1; to be export KOKKOS_TOOLS_SAMPLER_SKIP=1;), but then applying sampling probability of 1% to each kernel profile sample taken by continuing to use export KOKKOS_TOOLS_SAMPLER_PROBABILITY=1; as in the previous case's sample output. Here, roughly double the number of samples are taken compared to the previous one, which is what is expected since for each of the 600*4 =2400 kernel invocations, about 24 are randomly sampled. Again, the randomized sampling at the same sampling probability from two different runs is different.

Run 1

user@system stream % export KOKKOS_TOOLS_SAMPLER_SKIP=0; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=1; export KOKKOS_TOOLS_LIBS="${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/common/kokkos-sampler/libkp_kokkos_sampler.dylib;${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/debugging/kernel-logger/libkp_kernel_logger.dylib" ; ./stream600iters.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Note that both probability and skip rate is above 100. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with the specified sampling probability applied to the specified sampling skip rate set.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 600 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     triad
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     scale
KokkosP: Execution of kernel 1 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     set
KokkosP: Execution of kernel 2 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 3
KokkosP:     set
KokkosP: Execution of kernel 3 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 4
KokkosP:     copy
KokkosP: Execution of kernel 4 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 5
KokkosP:     copy
KokkosP: Execution of kernel 5 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 6
KokkosP:     scale
KokkosP: Execution of kernel 6 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 7
KokkosP:     copy
KokkosP: Execution of kernel 7 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 8
KokkosP:     scale
KokkosP: Execution of kernel 8 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 9
KokkosP:     scale
KokkosP: Execution of kernel 9 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 10
KokkosP:     scale
KokkosP: Execution of kernel 10 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 11
KokkosP:     triad
KokkosP: Execution of kernel 11 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 12
KokkosP:     set
KokkosP: Execution of kernel 12 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 13
KokkosP:     triad
KokkosP: Execution of kernel 13 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 14
KokkosP:     add
KokkosP: Execution of kernel 14 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 15
KokkosP:     scale
KokkosP: Execution of kernel 15 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 16
KokkosP:     add
KokkosP: Execution of kernel 16 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 17
KokkosP:     triad
KokkosP: Execution of kernel 17 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 18
KokkosP:     add
KokkosP: Execution of kernel 18 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 19
KokkosP:     triad
KokkosP: Execution of kernel 19 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 20
KokkosP:     set
KokkosP: Execution of kernel 20 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 21
KokkosP:     triad
KokkosP: Execution of kernel 21 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 22
KokkosP:     scale
KokkosP: Execution of kernel 22 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 23
KokkosP:     copy
KokkosP: Execution of kernel 23 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 24
KokkosP:     triad
KokkosP: Execution of kernel 24 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 25
KokkosP:     triad
KokkosP: Execution of kernel 25 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 26
KokkosP:     copy
KokkosP: Execution of kernel 26 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62644.78 MB/s
Copy               74481.68 MB/s
Scale              74779.31 MB/s
Add                82995.33 MB/s
Triad              82621.87 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

Run 2


user@system stream % export KOKKOS_TOOLS_SAMPLER_SKIP=0; export KOKKOS_TOOLS_SAMPLER_PROBABILITY=1; export KOKKOS_TOOLS_LIBS="${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/common/kokkos-sampler/libkp_kokkos_sampler.dylib;${MY_HOME_DIR}/ktools/kto-dev-vk/vbuild/debugging/kernel-logger/libkp_kernel_logger.dylib" ; ./stream600iters.exe
-------------------------------------------------------------
Kokkos STREAM Benchmark
-------------------------------------------------------------
KokkosP: Kernel Logger Library Initialized (sequence is 1, version: 20211015)
KokkosP: Note that both probability and skip rate are set. The Kokkos Tools Sampler utility will invoke a Kokkos Tool child event you specified (e.g., the profiler or debugger tool connector you specified in KOKKOS_TOOLS_LIBS) with the specified sampling probability applied to the specified sampling skip rate set.
Reports fastest timing per kernel
Creating Views...
Memory Sizes:
- Array Size:    100000000
- Per Array:           800.00 MB
- Total:              2400.00 MB
Benchmark kernels will be performed for 600 iterations.
-------------------------------------------------------------
Initializing Views...
Starting benchmarking...
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 0
KokkosP:     scale
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 1
KokkosP:     triad
KokkosP: Execution of kernel 1 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 2
KokkosP:     triad
KokkosP: Execution of kernel 2 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 3
KokkosP:     set
KokkosP: Execution of kernel 3 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 4
KokkosP:     copy
KokkosP: Execution of kernel 4 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 5
KokkosP:     copy
KokkosP: Execution of kernel 5 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 6
KokkosP:     set
KokkosP: Execution of kernel 6 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 7
KokkosP:     add
KokkosP: Execution of kernel 7 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 8
KokkosP:     scale
KokkosP: Execution of kernel 8 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 9
KokkosP:     set
KokkosP: Execution of kernel 9 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 10
KokkosP:     scale
KokkosP: Execution of kernel 10 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 11
KokkosP:     copy
KokkosP: Execution of kernel 11 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 12
KokkosP:     triad
KokkosP: Execution of kernel 12 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 13
KokkosP:     scale
KokkosP: Execution of kernel 13 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 14
KokkosP:     set
KokkosP: Execution of kernel 14 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 15
KokkosP:     triad
KokkosP: Execution of kernel 15 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 16
KokkosP:     set
KokkosP: Execution of kernel 16 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 17
KokkosP:     triad
KokkosP: Execution of kernel 17 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 18
KokkosP:     set
KokkosP: Execution of kernel 18 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 19
KokkosP:     copy
KokkosP: Execution of kernel 19 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 20
KokkosP:     copy
KokkosP: Execution of kernel 20 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 21
KokkosP:     scale
KokkosP: Execution of kernel 21 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 22
KokkosP:     triad
KokkosP: Execution of kernel 22 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 23
KokkosP:     add
KokkosP: Execution of kernel 23 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 24
KokkosP:     scale
KokkosP: Execution of kernel 24 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 25
KokkosP:     scale
KokkosP: Execution of kernel 25 is completed.
KokkosP: Executing parallel-for kernel on device 100663297 with unique execution identifier 26
KokkosP:     triad
KokkosP: Execution of kernel 26 is completed.
Performing validation...
All solutions checked and verified.
-------------------------------------------------------------
Set                62695.52 MB/s
Copy               74341.52 MB/s
Scale              74316.78 MB/s
Add                83041.27 MB/s
Triad              82691.61 MB/s
-------------------------------------------------------------
KokkosP: Kokkos library finalization called.

@vlkale
Copy link
Contributor Author

vlkale commented Aug 5, 2023

One note in the previous runs is that the device ID is wrong. This is another PR.

static uint64_t kernelSampleSkip =
101; // Default skip rate of every 100 invocations
static float tool_prob_num =
1.0; // Default probability of 1 percent of all invocations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to max of uint64_t for kernelSampleSkip and -1 for tool_prob_num

"sampling probability to 0 percent; none of the invocations of "
"a Kokkos Kernel will be profiled.\n");
tool_prob_num = 0.0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both tool_prob_num < 0 and kernelSampleSkip is max of uint64_t set tool_prob_num to 10.0

Make an error check checking that not. both of them are set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool_prob_num default has been assigned to the default requested. The kernelSampleSkip default is part of a new PR which focuses just on the correct matching of sampled kernels.

maximum uInt64_t for kernelSampleSkip and -1.0 for tool prob num
In this case, only use the  probability set 

Note: an alternative is to gracefully exit. Feedback welcome here.
@vlkale vlkale closed this Oct 13, 2023
@vlkale vlkale deleted the allow-randomized-sampling branch October 27, 2023 18:45
@vlkale vlkale restored the allow-randomized-sampling branch October 27, 2023 18:45
@vlkale vlkale deleted the allow-randomized-sampling branch October 27, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sampler needs to allow for randomized sampling (not just periodic)
2 participants