Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into fea/vshmem
Browse files Browse the repository at this point in the history
  • Loading branch information
elstehle committed Nov 7, 2023
2 parents 07acf9d + 591dc78 commit fa193ad
Show file tree
Hide file tree
Showing 73 changed files with 398 additions and 190 deletions.
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/NVIDIA/cccl?quickstart=1&devcontainer_path=.devcontainer%2Fdevcontainer.json)

|[Contributor Guide](https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md)|[Dev Containers](https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md)|[Discord](https://discord.gg/nvidiadeveloper)|[Godbolt](https://godbolt.org/z/x4G73af9a)|[GitHub Project](https://github.com/orgs/NVIDIA/projects/6)|[libcudacxx Docs](https://nvidia.github.io/libcudacxx/)|[Thrust Docs](https://nvidia.github.io/thrust/)|[CUB Docs](https://nvlabs.github.io/cub/)|
|[Contributor Guide](https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md)|[Dev Containers](https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md)|[Discord](https://discord.gg/nvidiadeveloper)|[Godbolt](https://godbolt.org/z/x4G73af9a)|[GitHub Project](https://github.com/orgs/NVIDIA/projects/6)|[libcudacxx Docs](https://nvidia.github.io/libcudacxx/)|[Thrust Docs](https://nvidia.github.io/thrust/)|[CUB Docs](https://nvlabs.github.io/cub/)|
|-|-|-|-|-|-|-|-|

# CUDA C++ Core Libraries (CCCL)
Expand Down Expand Up @@ -284,20 +284,22 @@ Anything not part of the public API may change at any time without warning.

#### API Versioning

The entirety of CCCL's public API across all components shares a common semantic version of `MAJOR.MINOR.PATCH`.
The public API of all CCCL's components share a unified semantic version of `MAJOR.MINOR.PATCH`.

Only the most recently released version is supported.
As a rule, features and bug fixes are not backported to previously released version or branches.

For historical reasons, the library versions are encoded separately in each of Thrust/CUB/libcudacxx as follows:
The preferred method for querying the version is to use `CCCL_[MAJOR/MINOR/PATCH_]VERSION` as described below.
For backwards compatibility, the Thrust/CUB/libcudacxxx version definitions are available and will always be consistent with `CCCL_VERSION`.
Note that Thrust/CUB use a `MMMmmmpp` scheme whereas the CCCL and libcudacxx use `MMMmmmppp`.

| | libcudacxx | Thrust | CUB | Incremented when? |
|------------------------|-------------------------------------------|-------------------------------|----------------------------|--------------------------------------------------------------------|
| Header | `<cuda/std/version>` | `<thrust/version.h>` | `<cub/version.h>` | - |
| Major Version | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR` | `THRUST_MAJOR_VERSION` | `CUB_MAJOR_VERSION` | Public API breaking changes (only at new CTK major release) |
| Minor Version | `_LIBCUDACXX_CUDA_API_VERSION_MINOR` | `THRUST_MINOR_VERSION` | `CUB_MINOR_VERSION` | Non-breaking feature additions |
| Patch/Subminor Version | `_LIBCUDACXX_CUDA_API_VERSION_PATCH` | `THRUST_SUBMINOR_VERSION` | `CUB_SUBMINOR_VERSION` | Minor changes not covered by major/minor versions |
| Concatenated Version | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)` | `CUB_VERSION (MMMmmmpp)` | - |
| | CCCL | libcudacxx | Thrust | CUB |
|------------------------|----------------------------------------|-------------------------------------------|------------------------------|---------------------------|
| Header | `<cuda/version>` | `<cuda/std/version>` | `<thrust/version.h>` | `<cub/version.h>` |
| Major Version | `CCCL_MAJOR_VERSION` | `_LIBCUDACXX_CUDA_API_VERSION_MAJOR` | `THRUST_MAJOR_VERSION` | `CUB_MAJOR_VERSION` |
| Minor Version | `CCCL_MINOR_VERSION` | `_LIBCUDACXX_CUDA_API_VERSION_MINOR` | `THRUST_MINOR_VERSION` | `CUB_MINOR_VERSION` |
| Patch/Subminor Version | `CCCL_PATCH_VERSION` | `_LIBCUDACXX_CUDA_API_VERSION_PATCH` | `THRUST_SUBMINOR_VERSION` | `CUB_SUBMINOR_VERSION` |
| Concatenated Version | `CCCL_VERSION (MMMmmmppp)` | `_LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp)`| `THRUST_VERSION (MMMmmmpp)` | `CUB_VERSION (MMMmmmpp)` |

### Application Binary Interface (ABI)

Expand Down Expand Up @@ -376,7 +378,7 @@ For a detailed overview of the CI pipeline, see [ci-overview.md](ci-overview.md)
Projects that are related to CCCL's mission to make CUDA C++ more delightful:
- [cuCollections](https://github.com/NVIDIA/cuCollections) - GPU accelerated data structures like hash tables
- [NVBench](https://github.com/NVIDIA/nvbench) - Benchmarking library tailored for CUDA applications
- [stdexec](https://github.com/nvidia/stdexec) - Reference implementation for Senders asynchronous programming model
- [stdexec](https://github.com/nvidia/stdexec) - Reference implementation for Senders asynchronous programming model
## Projects Using CCCL
Expand Down
8 changes: 8 additions & 0 deletions benchmarks/scripts/cccl/bench/bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,14 @@ def do_run(self, ct_point, rt_values, timeout, is_search=True):
cmd.append("--min-samples")
cmd.append("70")

# Unlike noise, minimal benchmarking time is not directly related to variance.
# Default minimal time is 0.5 seconds. For CI we want to reduce it to 0.1 seconds,
# becuse we have limited time budget. Having smaller minimal time doesn't affect
# stability of sample distribution median in a deterministic way. For small problem sizes,
# 0.1s leads to smaller variation than 0.5s. For other workloads, 0.5 leads to smaller variance.
cmd.append("--min-time")
cmd.append("0.1")

# NVBench is currently broken for multiple GPUs, use `CUDA_VISIBLE_DEVICES`
cmd.append("-d")
cmd.append("0")
Expand Down
39 changes: 29 additions & 10 deletions benchmarks/scripts/cccl/bench/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,38 @@ def parse_arguments():
type=str, help="Parameter in the format `Param=Value`.")
parser.add_argument(
'--list-benches', action=argparse.BooleanOptionalAction, help="Show available benchmarks.")
parser.add_argument('--num-shards', type=int, default=1, help='Split benchmarks into M pieces and only run one')
parser.add_argument('--run-shard', type=int, default=0, help='Run shard N / M of benchmarks')
parser.add_argument('-P0', action=argparse.BooleanOptionalAction, help="Run P0 benchmarks (overwrites -R)")
return parser.parse_args()


def run_benches(benchmarks, sub_space, regex, seeker):
pattern = re.compile(regex)
def run_benches(algnames, sub_space, seeker):
for algname in algnames:
bench = BaseBench(algname)
ct_space = bench.ct_workload_space(sub_space)
rt_values = bench.rt_axes_values(sub_space)
seeker(algname, ct_space, rt_values)

for algname in benchmarks:
if pattern.match(algname):
bench = BaseBench(algname)
ct_space = bench.ct_workload_space(sub_space)
rt_values = bench.rt_axes_values(sub_space)
seeker(algname, ct_space, rt_values)

def filter_benchmarks(benchmarks, args):
if args.run_shard >= args.num_shards:
raise ValueError('run-shard must be less than num-shards')

R = args.R
if args.P0:
R = '^(?!.*segmented).*(scan|reduce|select|sort).*'

pattern = re.compile(R)
algnames = list(filter(lambda x: pattern.match(x), benchmarks.keys()))
algnames.sort()

if args.num_shards > 1:
algnames = np.array_split(algnames, args.num_shards)[args.run_shard].tolist()
return algnames

return algnames


def search(seeker):
args = parse_arguments()
Expand All @@ -79,8 +98,8 @@ def search(seeker):
if args.list_benches:
list_benches()
return

run_benches(config.benchmarks, workload_sub_space, args.R, seeker)
run_benches(filter_benchmarks(config.benchmarks, args), workload_sub_space, seeker)


class MedianCenterEstimator:
Expand Down
22 changes: 20 additions & 2 deletions benchmarks/scripts/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,36 @@
import cccl.bench


def elapsed_time_look_good(x):
def elapsed_time_looks_good(x):
if isinstance(x, float):
if math.isfinite(x):
return True
return False


def problem_size_looks_large_enough(elements):
# Small problem sizes do not utilize entire GPU.
# Benchmarking small problem sizes in environments where we do not control
# distributions comparison, e.g. CI, is not useful because of stability issues.
return elements.isdigit() and int(elements) > 20


def filter_runtime_workloads_for_ci(rt_values):
for subbench in rt_values:
for axis in rt_values[subbench]:
if axis.startswith('Elements') and axis.endswith('[pow2]'):
rt_values[subbench][axis] = list(filter(problem_size_looks_large_enough, rt_values[subbench][axis]))

return rt_values


class BaseRunner:
def __init__(self):
self.estimator = cccl.bench.MedianCenterEstimator()

def __call__(self, algname, ct_workload_space, rt_values):
rt_values = filter_runtime_workloads_for_ci(rt_values)

for ct_workload in ct_workload_space:
bench = cccl.bench.BaseBench(algname)
if bench.build():
Expand All @@ -28,7 +46,7 @@ def __call__(self, algname, ct_workload_space, rt_values):
bench_name = bench_name.replace(' ', '___')
bench_name = "".join(c if c.isalnum() else "_" for c in bench_name)
elapsed_time = results[subbench][point]
if elapsed_time_look_good(elapsed_time):
if elapsed_time_looks_good(elapsed_time):
print("&&&& PERF {} {} -sec".format(bench_name, elapsed_time))
else:
print("&&&& FAILED bench")
Expand Down
101 changes: 101 additions & 0 deletions ci/update_version.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/bin/bash

# Usage: ./update_version.sh [--dry-run] <major> <minor> <patch>
# Example: ./update_version.sh --dry-run 2 2 1

# Run in root cccl/
cd "$(dirname "${BASH_SOURCE[0]}")/.." || exit

DRY_RUN=false

while [[ "$#" -gt 0 ]]; do
case "$1" in
--dry-run) DRY_RUN=true; ;;
*) break ;;
esac
shift
done

major="$1"
minor="$2"
patch="$3"

if [ -z "$major" ] || [ -z "$minor" ] || [ -z "$patch" ]; then
echo "Usage: $0 [--dry-run] <major> <minor> <patch>"
exit 1
fi

# Version file paths
CCCL_VERSION_FILE="libcudacxx/include/cuda/std/detail/libcxx/include/__cccl/version.h"
THRUST_VERSION_FILE="thrust/thrust/version.h"
CUB_VERSION_FILE="cub/cub/version.cuh"
CCCL_CMAKE_VERSION_FILE="lib/cmake/cccl/cccl-config-version.cmake"
CUB_CMAKE_VERSION_FILE="cub/cub/cmake/cub-config-version.cmake"
LIBCUDACXX_CMAKE_VERSION_FILE="libcudacxx/lib/cmake/libcudacxx/libcudacxx-config-version.cmake"
THRUST_CMAKE_VERSION_FILE="thrust/thrust/cmake/thrust-config-version.cmake"

# Calculated version codes
new_cccl_version=$((major * 1000000 + minor * 1000 + patch)) # MMMmmmppp
new_thrust_cub_version=$((major * 100000 + minor * 100 + patch)) # MMMmmmpp

# Fetch current version from file
current_cccl_version=$(grep -oP "define CCCL_VERSION \K[0-9]+" "$CCCL_VERSION_FILE")

# Fetch the latest tag from git and strip the 'v' prefix if present
latest_tag=$(git tag --sort=-v:refname | head -n 1 | sed 's/^v//')

# Since the tags and versions are numerically comparable, we cast them to integers
latest_tag_version=$(echo "$latest_tag" | awk -F. '{ printf("%d%03d%03d", $1,$2,$3) }')

echo "Running in $(pwd)"
echo "New MMMmmmppp version: $new_cccl_version"
echo "New MMMmmmpp version: $new_thrust_cub_version"
echo "Current CCCL version: $current_cccl_version"
echo "Latest git tag: $latest_tag"

# Check if new version is less than or equal to current or the latest tag
if (( new_cccl_version < current_cccl_version )) || (( new_cccl_version < latest_tag_version )); then
echo "Error: New version $new_cccl_version is less than current version $current_cccl_version or latest git tag version $latest_tag_version."
exit 1
fi

update_file () {
local file=$1
local pattern=$2
local new_value=$3
if [ "$DRY_RUN" = true ]; then
local temp_file=$(mktemp)
sed "s/$pattern/$new_value/g" "$file" > "$temp_file"
diff --color=auto -U 0 "$file" "$temp_file" || true
rm "$temp_file"
else
sed -i "s/$pattern/$new_value/" "$file"
fi
}

# Update version information in files
update_file "$CCCL_VERSION_FILE" "^#define CCCL_VERSION \([0-9]\+\)" "#define CCCL_VERSION $new_cccl_version"
update_file "$THRUST_VERSION_FILE" "^#define THRUST_VERSION \([0-9]\+\)" "#define THRUST_VERSION $new_thrust_cub_version"
update_file "$CUB_VERSION_FILE" "^#define CUB_VERSION \([0-9]\+\)" "#define CUB_VERSION $new_thrust_cub_version"

update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_MAJOR \([0-9]\+\))" "set(CUB_VERSION_MAJOR $major)"
update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_MINOR \([0-9]\+\))" "set(CUB_VERSION_MINOR $minor)"
update_file "$CUB_CMAKE_VERSION_FILE" "set(CUB_VERSION_PATCH \([0-9]\+\))" "set(CUB_VERSION_PATCH $patch)"

update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_MAJOR \([0-9]\+\))" "set(libcudacxx_VERSION_MAJOR $major)"
update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_MINOR \([0-9]\+\))" "set(libcudacxx_VERSION_MINOR $minor)"
update_file "$LIBCUDACXX_CMAKE_VERSION_FILE" "set(libcudacxx_VERSION_PATCH \([0-9]\+\))" "set(libcudacxx_VERSION_PATCH $patch)"

update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_MAJOR \([0-9]\+\))" "set(THRUST_VERSION_MAJOR $major)"
update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_MINOR \([0-9]\+\))" "set(THRUST_VERSION_MINOR $minor)"
update_file "$THRUST_CMAKE_VERSION_FILE" "set(THRUST_VERSION_PATCH \([0-9]\+\))" "set(THRUST_VERSION_PATCH $patch)"

update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_MAJOR \([0-9]\+\))" "set(CCCL_VERSION_MAJOR $major)"
update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_MINOR \([0-9]\+\))" "set(CCCL_VERSION_MINOR $minor)"
update_file "$CCCL_CMAKE_VERSION_FILE" "set(CCCL_VERSION_PATCH \([0-9]\+\))" "set(CCCL_VERSION_PATCH $patch)"

if [ "$DRY_RUN" = true ]; then
echo "Dry run completed. No changes made."
else
echo "Version updated to $major.$minor.$patch"
fi
2 changes: 1 addition & 1 deletion cub/benchmarks/bench/adjacent_difference/subtract_left.cu
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ void left(nvbench::state& state, nvbench::type_list<T, OffsetT>)
thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
std::uint8_t* d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

state.exec([&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
dispatch_t::Dispatch(d_temp_storage,
temp_storage_bytes,
d_in,
Expand Down
9 changes: 4 additions & 5 deletions cub/benchmarks/bench/copy/memcpy.cu
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ void copy(nvbench::state &state,
thrust::device_vector<nvbench::uint8_t> temp_storage(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch | nvbench::exec_tag::sync, [&](nvbench::launch &launch) {
dispatch_t::Dispatch(d_temp_storage,
temp_storage_bytes,
d_input_buffers,
Expand All @@ -275,8 +275,8 @@ void uniform(nvbench::state &state, nvbench::type_list<T, OffsetT> tl)
elements,
min_buffer_size,
max_buffer_size,
state.get_int64("RandomizeInput"),
state.get_int64("RandomizeOutput"));
state.get_int64("Randomize"),
state.get_int64("Randomize"));
}

template <class T, class OffsetT>
Expand Down Expand Up @@ -309,8 +309,7 @@ NVBENCH_BENCH_TYPES(uniform, NVBENCH_TYPE_AXES(types, u_offset_types))
.add_int64_power_of_two_axis("Elements{io}", nvbench::range(25, 29, 2))
.add_int64_axis("MinBufferSizeRatio", {1, 99})
.add_int64_axis("MaxBufferSize", {8, 64, 256, 1024, 64 * 1024})
.add_int64_axis("RandomizeInput", {0, 1})
.add_int64_axis("RandomizeOutput", {0, 1});
.add_int64_axis("Randomize", {0, 1});

NVBENCH_BENCH_TYPES(large, NVBENCH_TYPE_AXES(types, u_offset_types))
.set_name("large")
Expand Down
6 changes: 3 additions & 3 deletions cub/benchmarks/bench/histogram/even.cu
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ static void even(nvbench::state &state, nvbench::type_list<SampleT, CounterT, Of
thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(tmp.data());

state.exec([&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
dispatch_t::DispatchEven(d_temp_storage,
temp_storage_bytes,
d_input,
Expand Down Expand Up @@ -139,5 +139,5 @@ NVBENCH_BENCH_TYPES(even, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offset
.set_name("base")
.set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
.add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
.add_int64_axis("Bins", {32, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "1.000"});
6 changes: 3 additions & 3 deletions cub/benchmarks/bench/histogram/multi/even.cu
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ static void even(nvbench::state &state, nvbench::type_list<SampleT, CounterT, Of
thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(tmp.data());

state.exec([&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
dispatch_t::DispatchEven(d_temp_storage,
temp_storage_bytes,
d_input,
Expand Down Expand Up @@ -150,5 +150,5 @@ NVBENCH_BENCH_TYPES(even, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offset
.set_name("base")
.set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
.add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
.add_int64_axis("Bins", {32, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "1.000"});
6 changes: 3 additions & 3 deletions cub/benchmarks/bench/histogram/multi/range.cu
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ static void range(nvbench::state &state, nvbench::type_list<SampleT, CounterT, O
thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(tmp.data());

state.exec([&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
dispatch_t::DispatchRange(d_temp_storage,
temp_storage_bytes,
d_input,
Expand Down Expand Up @@ -157,5 +157,5 @@ NVBENCH_BENCH_TYPES(range, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offse
.set_name("base")
.set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
.add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
.add_int64_axis("Bins", {32, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "1.000"});
6 changes: 3 additions & 3 deletions cub/benchmarks/bench/histogram/range.cu
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ static void range(nvbench::state &state, nvbench::type_list<SampleT, CounterT, O
thrust::device_vector<nvbench::uint8_t> tmp(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(tmp.data());

state.exec([&](nvbench::launch &launch) {
state.exec(nvbench::exec_tag::no_batch, [&](nvbench::launch &launch) {
dispatch_t::DispatchRange(d_temp_storage,
temp_storage_bytes,
d_input,
Expand Down Expand Up @@ -144,5 +144,5 @@ NVBENCH_BENCH_TYPES(range, NVBENCH_TYPE_AXES(sample_types, bin_types, some_offse
.set_name("base")
.set_type_axes_names({"SampleT{ct}", "BinT{ct}", "OffsetT{ct}"})
.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))
.add_int64_axis("Bins", {32, 64, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "0.544", "1.000"});
.add_int64_axis("Bins", {32, 128, 2048, 2097152})
.add_string_axis("Entropy", {"0.201", "1.000"});
Loading

0 comments on commit fa193ad

Please sign in to comment.