Merge remote-tracking branch 'upstream/main' into fea/vshmem

NVIDIA · Nov 7, 2023 · fa193ad · fa193ad
2 parents 07acf9d + 591dc78
commit fa193ad
Show file tree

Hide file tree

Showing 73 changed files with 398 additions and 190 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/NVIDIA/cccl?quickstart=1&devcontainer_path=.devcontainer%2Fdevcontainer.json)
 
-|[Contributor Guide](https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md)|[Dev Containers](https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md)|[Discord](https://discord.gg/nvidiadeveloper)|[Godbolt](https://godbolt.org/z/x4G73af9a)|[GitHub Project](https://github.com/orgs/NVIDIA/projects/6)|[libcudacxx Docs](https://nvidia.github.io/libcudacxx/)|[Thrust Docs](https://nvidia.github.io/thrust/)|[CUB Docs](https://nvlabs.github.io/cub/)| 
+|[Contributor Guide](https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md)|[Dev Containers](https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md)|[Discord](https://discord.gg/nvidiadeveloper)|[Godbolt](https://godbolt.org/z/x4G73af9a)|[GitHub Project](https://github.com/orgs/NVIDIA/projects/6)|[libcudacxx Docs](https://nvidia.github.io/libcudacxx/)|[Thrust Docs](https://nvidia.github.io/thrust/)|[CUB Docs](https://nvlabs.github.io/cub/)|
 |-|-|-|-|-|-|-|-|
 
 # CUDA C++ Core Libraries (CCCL)
@@ -284,20 +284,22 @@ Anything not part of the public API may change at any time without warning.
 
 #### API Versioning
 
-The entirety of CCCL's public API across all components shares a common semantic version of `MAJOR.MINOR.PATCH`.
+The public API of all CCCL's components share a unified semantic version of `MAJOR.MINOR.PATCH`.
 
 Only the most recently released version is supported.
 As a rule, features and bug fixes are not backported to previously released version or branches.
 
-For historical reasons, the library versions are encoded separately in each of Thrust/CUB/libcudacxx as follows:
+The preferred method for querying the version is to use `CCCL_[MAJOR/MINOR/PATCH_]VERSION` as described below.
+For backwards compatibility, the Thrust/CUB/libcudacxxx version definitions are available and will always be consistent with `CCCL_VERSION`.
+Note that Thrust/CUB use a `MMMmmmpp` scheme whereas the CCCL and libcudacxx use `MMMmmmppp`.
 
-|                        | libcudacxx                                | Thrust                        | CUB                        | Incremented when?                                                  |
-|------------------------|-------------------------------------------|-------------------------------|----------------------------|--------------------------------------------------------------------|
-| Header                 | `<cuda/std/version>`                      | `<thrust/version.h>`          | `<cub/version.h>`          | -                                                                  |
-| Major Version          | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR`      | `THRUST_MAJOR_VERSION`        | `CUB_MAJOR_VERSION`        | Public API breaking changes (only at new CTK major release)        |
-| Minor Version          | `_LIBCUDACXX_CUDA_API_VERSION_MINOR`      | `THRUST_MINOR_VERSION`        | `CUB_MINOR_VERSION`        | Non-breaking feature additions                                     |
-| Patch/Subminor Version | `_LIBCUDACXX_CUDA_API_VERSION_PATCH`      | `THRUST_SUBMINOR_VERSION`     | `CUB_SUBMINOR_VERSION`     | Minor changes not covered by major/minor versions                  |
-| Concatenated Version   | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)`   | `CUB_VERSION (MMMmmmpp)`   | -                                                                  |
+|                        | CCCL                                   | libcudacxx                                | Thrust                       | CUB                       |
+|------------------------|----------------------------------------|-------------------------------------------|------------------------------|---------------------------|
+| Header                 | `<cuda/version>`                       | `<cuda/std/version>`                      | `<thrust/version.h>`         | `<cub/version.h>`         |
+| Major Version          | `CCCL_MAJOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR`      | `THRUST_MAJOR_VERSION`       | `CUB_MAJOR_VERSION`       |
+| Minor Version          | `CCCL_MINOR_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_MINOR`      | `THRUST_MINOR_VERSION`       | `CUB_MINOR_VERSION`       |
+| Patch/Subminor Version | `CCCL_PATCH_VERSION`                   | `_LIBCUDACXX_CUDA_API_VERSION_PATCH`      | `THRUST_SUBMINOR_VERSION`    | `CUB_SUBMINOR_VERSION`    |
+| Concatenated Version   | `CCCL_VERSION (MMMmmmppp)`             | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)`  | `CUB_VERSION (MMMmmmpp)`  |
 
 ### Application Binary Interface (ABI)
 
@@ -376,7 +378,7 @@ For a detailed overview of the CI pipeline, see [ci-overview.md](ci-overview.md)
 Projects that are related to CCCL's mission to make CUDA C++ more delightful:
 - [cuCollections](https://github.com/NVIDIA/cuCollections) - GPU accelerated data structures like hash tables
 - [NVBench](https://github.com/NVIDIA/nvbench) - Benchmarking library tailored for CUDA applications
-- [stdexec](https://github.com/nvidia/stdexec) - Reference implementation for Senders asynchronous programming model 
+- [stdexec](https://github.com/nvidia/stdexec) - Reference implementation for Senders asynchronous programming model
 
 ## Projects Using CCCL
 

diff --git a/benchmarks/scripts/cccl/bench/bench.py b/benchmarks/scripts/cccl/bench/bench.py
@@ -635,6 +635,14 @@ def do_run(self, ct_point, rt_values, timeout, is_search=True):
                 cmd.append("--min-samples")
                 cmd.append("70")
 
+            # Unlike noise, minimal benchmarking time is not directly related to variance.
+            # Default minimal time is 0.5 seconds. For CI we want to reduce it to 0.1 seconds, 
+            # becuse we have limited time budget. Having smaller minimal time doesn't affect 
+            # stability of sample distribution median in a deterministic way. For small problem sizes, 
+            # 0.1s leads to smaller variation than 0.5s. For other workloads, 0.5 leads to smaller variance. 
+            cmd.append("--min-time")
+            cmd.append("0.1")
+
             # NVBench is currently broken for multiple GPUs, use `CUDA_VISIBLE_DEVICES`
             cmd.append("-d")
             cmd.append("0")

diff --git a/benchmarks/scripts/cccl/bench/search.py b/benchmarks/scripts/cccl/bench/search.py
@@ -47,19 +47,38 @@ def parse_arguments():
                         type=str, help="Parameter in the format `Param=Value`.")
     parser.add_argument(
         '--list-benches', action=argparse.BooleanOptionalAction, help="Show available benchmarks.")
+    parser.add_argument('--num-shards', type=int, default=1, help='Split benchmarks into M pieces and only run one')
+    parser.add_argument('--run-shard', type=int, default=0, help='Run shard N / M of benchmarks')
+    parser.add_argument('-P0', action=argparse.BooleanOptionalAction, help="Run P0 benchmarks (overwrites -R)")
     return parser.parse_args()
 
 
-def run_benches(benchmarks, sub_space, regex, seeker):
-    pattern = re.compile(regex)
+def run_benches(algnames, sub_space, seeker):
+    for algname in algnames:
+        bench = BaseBench(algname)
+        ct_space = bench.ct_workload_space(sub_space)
+        rt_values = bench.rt_axes_values(sub_space)
+        seeker(algname, ct_space, rt_values)
 
-    for algname in benchmarks:
-        if pattern.match(algname):
-            bench = BaseBench(algname)
-            ct_space = bench.ct_workload_space(sub_space)
-            rt_values = bench.rt_axes_values(sub_space)
-            seeker(algname, ct_space, rt_values)
 
+def filter_benchmarks(benchmarks, args):
+    if args.run_shard >= args.num_shards:
+        raise ValueError('run-shard must be less than num-shards')
+
+    R = args.R
+    if args.P0:
+        R = '^(?!.*segmented).*(scan|reduce|select|sort).*'
+
+    pattern = re.compile(R)
+    algnames = list(filter(lambda x: pattern.match(x), benchmarks.keys()))
+    algnames.sort()
+
+    if args.num_shards > 1:
+        algnames = np.array_split(algnames, args.num_shards)[args.run_shard].tolist()
+        return algnames
+
+    return algnames
+
 
 def search(seeker):
     args = parse_arguments()
@@ -79,8 +98,8 @@ def search(seeker):
     if args.list_benches:
         list_benches()
         return
-
-    run_benches(config.benchmarks, workload_sub_space, args.R, seeker)
+    
+    run_benches(filter_benchmarks(config.benchmarks, args), workload_sub_space, seeker)
 
 
 class MedianCenterEstimator:

diff --git a/benchmarks/scripts/run.py b/benchmarks/scripts/run.py
@@ -6,18 +6,36 @@
 import cccl.bench
 
 
-def elapsed_time_look_good(x):
+def elapsed_time_looks_good(x):
   if isinstance(x, float):
     if math.isfinite(x):
       return True
   return False
 
 
+def problem_size_looks_large_enough(elements):
+  # Small problem sizes do not utilize entire GPU.
+  # Benchmarking small problem sizes in environments where we do not control
+  # distributions comparison, e.g. CI, is not useful because of stability issues.
+  return elements.isdigit() and int(elements) > 20
+
+
+def filter_runtime_workloads_for_ci(rt_values):
+  for subbench in rt_values:
+    for axis in rt_values[subbench]:
+      if axis.startswith('Elements') and axis.endswith('[pow2]'):
+        rt_values[subbench][axis] = list(filter(problem_size_looks_large_enough, rt_values[subbench][axis]))
+
+  return rt_values
+
+
 class BaseRunner:
   def __init__(self):
     self.estimator = cccl.bench.MedianCenterEstimator()
 
   def __call__(self, algname, ct_workload_space, rt_values):
+    rt_values = filter_runtime_workloads_for_ci(rt_values)
+
     for ct_workload in ct_workload_space:
       bench = cccl.bench.BaseBench(algname)
       if bench.build():
@@ -28,7 +46,7 @@ def __call__(self, algname, ct_workload_space, rt_values):
             bench_name = bench_name.replace(' ', '___')
             bench_name = "".join(c if c.isalnum() else "_" for c in bench_name)
             elapsed_time = results[subbench][point]
-            if elapsed_time_look_good(elapsed_time):
+            if elapsed_time_looks_good(elapsed_time):
               print("&&&& PERF {} {} -sec".format(bench_name, elapsed_time))
       else:
         print("&&&& FAILED bench")

diff --git a/ci/update_version.sh b/ci/update_version.sh
@@ -0,0 +1,101 @@
+#!/bin/bash
+
+# Usage: ./update_version.sh [--dry-run] <major> <minor> <patch>
+# Example: ./update_version.sh --dry-run 2 2 1
+
+# Run in root cccl/
+cd "$(dirname "${BASH_SOURCE[0]}")/.." || exit
+
+DRY_RUN=false
+
+while [[ "$#" -gt 0 ]]; do
+    case "$1" in
+        --dry-run) DRY_RUN=true; ;;
+        *) break ;;
+    esac
+    shift
+done
+
+major="$1"
+minor="$2"
+patch="$3"
+
+if [ -z "$major" ] || [ -z "$minor" ] || [ -z "$patch" ]; then
+    echo "Usage: $0 [--dry-run] <major> <minor> <patch>"
+    exit 1
+fi
+
+# Version file paths
+CCCL_VERSION_FILE="libcudacxx/include/cuda/std/detail/libcxx/include/__cccl/version.h"
+THRUST_VERSION_FILE="thrust/thrust/version.h"
+CUB_VERSION_FILE="cub/cub/version.cuh"
+CCCL_CMAKE_VERSION_FILE="lib/cmake/cccl/cccl-config-version.cmake"
+CUB_CMAKE_VERSION_FILE="cub/cub/cmake/cub-config-version.cmake"
+LIBCUDACXX_CMAKE_VERSION_FILE="libcudacxx/lib/cmake/libcudacxx/libcudacxx-config-version.cmake"
+THRUST_CMAKE_VERSION_FILE="thrust/thrust/cmake/thrust-config-version.cmake"
+
+# Calculated version codes
+new_cccl_version=$((major * 1000000 + minor * 1000 + patch))     # MMMmmmppp
+new_thrust_cub_version=$((major * 100000 + minor * 100 + patch)) # MMMmmmpp
+
+# Fetch current version from file
+current_cccl_version=$(grep -oP "define CCCL_VERSION \K[0-9]+" "$CCCL_VERSION_FILE")
+
+# Fetch the latest tag from git and strip the 'v' prefix if present
+latest_tag=$(git tag --sort=-v:refname | head -n 1 | sed 's/^v//')
+
+# Since the tags and versions are numerically comparable, we cast them to integers
+latest_tag_version=$(echo "$latest_tag" | awk -F. '{ printf("%d%03d%03d", $1,$2,$3) }')
+
+echo "Running in $(pwd)"
+echo "New MMMmmmppp version: $new_cccl_version"
+echo "New MMMmmmpp  version: $new_thrust_cub_version"
+echo "Current CCCL version:  $current_cccl_version"
+echo "Latest git tag:        $latest_tag"
+
+# Check if new version is less than or equal to current or the latest tag
+if (( new_cccl_version < current_cccl_version )) || (( new_cccl_version < latest_tag_version )); then
+    echo "Error: New version $new_cccl_version is less than current version $current_cccl_version or latest git tag version $latest_tag_version."
+    exit 1
+fi
+
+update_file () {
+    local file=$1
+    local pattern=$2
+    local new_value=$3
+    if [ "$DRY_RUN" = true ]; then
+        local temp_file=$(mktemp)
+        sed "s/$pattern/$new_value/g" "$file" > "$temp_file"
+        diff --color=auto -U 0 "$file" "$temp_file" || true
+        rm "$temp_file"
+    else
+        sed -i "s/$pattern/$new_value/" "$file"
+    fi
+}
+
+# Update version information in files
+update_file "$CCCL_VERSION_FILE" "^#define CCCL_VERSION \([0-9]\+\)" "#define CCCL_VERSION $new_cccl_version"
+update_file "$THRUST_VERSION_FILE" "^#define THRUST_VERSION \([0-9]\+\)" "#define THRUST_VERSION $new_thrust_cub_version"
+update_file "$CUB_VERSION_FILE" "^#define CUB_VERSION \([0-9]\+\)" "#define CUB_VERSION $new_thrust_cub_version"
+
+update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_MAJOR \([0-9]\+\))" "set(CUB_VERSION_MAJOR $major)"
+update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_MINOR \([0-9]\+\))" "set(CUB_VERSION_MINOR $minor)"
+update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_PATCH \([0-9]\+\))" "set(CUB_VERSION_PATCH $patch)"
+
+update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_MAJOR \([0-9]\+\))" "set(libcudacxx_VERSION_MAJOR $major)"
+update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_MINOR \([0-9]\+\))" "set(libcudacxx_VERSION_MINOR $minor)"
+update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_PATCH \([0-9]\+\))" "set(libcudacxx_VERSION_PATCH $patch)"
+
+update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_MAJOR \([0-9]\+\))" "set(THRUST_VERSION_MAJOR $major)"
+update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_MINOR \([0-9]\+\))" "set(THRUST_VERSION_MINOR $minor)"
+update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_PATCH \([0-9]\+\))" "set(THRUST_VERSION_PATCH $patch)"
+
+update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_MAJOR \([0-9]\+\))" "set(CCCL_VERSION_MAJOR $major)"
+update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_MINOR \([0-9]\+\))" "set(CCCL_VERSION_MINOR $minor)"
+update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_PATCH \([0-9]\+\))" "set(CCCL_VERSION_PATCH $patch)"
+
+if [ "$DRY_RUN" = true ]; then
+    echo "Dry run completed. No changes made."
+else
+    echo "Version updated to $major.$minor.$patch"
+fi
diff --git a/cub/benchmarks/bench/adjacent_difference/subtract_left.cu b/cub/benchmarks/bench/adjacent_difference/subtract_left.cu
@@ -100,7 +100,7 @@ void left(nvbench::state& state, nvbench::type_list<T, OffsetT>)
   thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
   std::uint8_t* d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());
 
-  state.exec([&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
     dispatch_t::Dispatch(d_temp_storage,
                          temp_storage_bytes,
                          d_in,

diff --git a/cub/benchmarks/bench/copy/memcpy.cu b/cub/benchmarks/bench/copy/memcpy.cu
@@ -249,7 +249,7 @@ void copy(nvbench::state &state,
   thrust::device_vector<nvbench::uint8_t> temp_storage(temp_storage_bytes);
   d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());
 
-  state.exec(nvbench::exec_tag::sync, [&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch | nvbench::exec_tag::sync, [&](nvbench::launch &launch) {
     dispatch_t::Dispatch(d_temp_storage,
                          temp_storage_bytes,
                          d_input_buffers,
@@ -275,8 +275,8 @@ void uniform(nvbench::state &state, nvbench::type_list<T, OffsetT> tl)
        elements,
        min_buffer_size,
        max_buffer_size,
-       state.get_int64("RandomizeInput"),
-       state.get_int64("RandomizeOutput"));
+       state.get_int64("Randomize"),
+       state.get_int64("Randomize"));
 }
 
 template <class T, class OffsetT>
@@ -309,8 +309,7 @@ NVBENCH_BENCH_TYPES(uniform, NVBENCH_TYPE_AXES(types, u_offset_types))
   .add_int64_power_of_two_axis("Elements{io}", nvbench::range(25, 29, 2))
   .add_int64_axis("MinBufferSizeRatio", {1, 99})
   .add_int64_axis("MaxBufferSize", {8, 64, 256, 1024, 64 * 1024})
-  .add_int64_axis("RandomizeInput", {0, 1})
-  .add_int64_axis("RandomizeOutput", {0, 1});
+  .add_int64_axis("Randomize", {0, 1});
 
 NVBENCH_BENCH_TYPES(large, NVBENCH_TYPE_AXES(types, u_offset_types))
   .set_name("large")

diff --git a/cub/benchmarks/bench/histogram/even.cu b/cub/benchmarks/bench/histogram/even.cu
@@ -110,7 +110,7 @@ static void even(nvbench::state &state, nvbench::type_list<SampleT, CounterT, Of
   thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
   d_temp_storage = thrust::raw_pointer_cast(tmp.data());
 
-  state.exec([&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
     dispatch_t::DispatchEven(d_temp_storage,
                              temp_storage_bytes,
                              d_input,
@@ -139,5 +139,5 @@ NVBENCH_BENCH_TYPES(even, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offset
   .set_name("base")
   .set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
   .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
-  .add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
-  .add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
+  .add_int64_axis("Bins", {32, 128, 2048, 2097152})
+  .add_string_axis("Entropy", {"0.201", "1.000"});
diff --git a/cub/benchmarks/bench/histogram/multi/even.cu b/cub/benchmarks/bench/histogram/multi/even.cu
@@ -121,7 +121,7 @@ static void even(nvbench::state &state, nvbench::type_list<SampleT, CounterT, Of
   thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
   d_temp_storage = thrust::raw_pointer_cast(tmp.data());
 
-  state.exec([&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
     dispatch_t::DispatchEven(d_temp_storage,
                              temp_storage_bytes,
                              d_input,
@@ -150,5 +150,5 @@ NVBENCH_BENCH_TYPES(even, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offset
   .set_name("base")
   .set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
   .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
-  .add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
-  .add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
+  .add_int64_axis("Bins", {32, 128, 2048, 2097152})
+  .add_string_axis("Entropy", {"0.201", "1.000"});
diff --git a/cub/benchmarks/bench/histogram/multi/range.cu b/cub/benchmarks/bench/histogram/multi/range.cu
@@ -129,7 +129,7 @@ static void range(nvbench::state &state, nvbench::type_list<SampleT, CounterT, O
   thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
   d_temp_storage = thrust::raw_pointer_cast(tmp.data());
 
-  state.exec([&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
     dispatch_t::DispatchRange(d_temp_storage,
                               temp_storage_bytes,
                               d_input,
@@ -157,5 +157,5 @@ NVBENCH_BENCH_TYPES(range, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offse
   .set_name("base")
   .set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
   .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
-  .add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
-  .add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
+  .add_int64_axis("Bins", {32, 128, 2048, 2097152})
+  .add_string_axis("Entropy", {"0.201", "1.000"});
diff --git a/cub/benchmarks/bench/histogram/range.cu b/cub/benchmarks/bench/histogram/range.cu
@@ -116,7 +116,7 @@ static void range(nvbench::state &state, nvbench::type_list<SampleT, CounterT, O
   thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
   d_temp_storage = thrust::raw_pointer_cast(tmp.data());
 
-  state.exec([&](nvbench::launch &launch) {
+  state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
     dispatch_t::DispatchRange(d_temp_storage,
                               temp_storage_bytes,
                               d_input,
@@ -144,5 +144,5 @@ NVBENCH_BENCH_TYPES(range, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offse
   .set_name("base")
   .set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
   .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
-  .add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
-  .add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
+  .add_int64_axis("Bins", {32, 128, 2048, 2097152})
+  .add_string_axis("Entropy", {"0.201", "1.000"});