diff --git a/CHANGELOG.md b/CHANGELOG.md index e73cffb0a..95980e30a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,7 +18,9 @@ Versioning](http://semver.org/spec/v2.0.0.html). - Experimental support for the AdaptiveCpp generic single-pass compiler (#294) - Constructor overloads to the `access::neighborhood` range mapper for reads in 3/5/7-point stencil codes (#292) - The SYCL backend now uses per-device submission threads to dispatch commands for better performance. - This new behaviour is enabled by default, and can be disabled via "CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS" (#303) + This new behaviour is enabled by default, and can be disabled via `CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS` (#303) +- Celerity now has a thread pinning mechanism to control how threads are pinned to CPU cores. + This can be controlled via the `CELERITY_THREAD_PINNING` environment variable (#309) ### Changed @@ -30,6 +32,7 @@ Versioning](http://semver.org/spec/v2.0.0.html). This operation has a much more pronounced performance penalty than its SYCL counterpart (#283) - On systems that do not support device-to-device copies, data is now staged in linearized buffers for better performance (#287) - The `access::neighborhood` built-in range mapper now receives a `range` instead of a coordinate list (#292) +- Overhauled the [installation](docs/installation.md) and [configuration](docs/configuration.md) documentation (#309) ### Fixed diff --git a/README.md b/README.md index 6a2624f44..7350a5958 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,8 @@ # Celerity Runtime — [![CI Workflow](https://github.com/celerity/celerity-runtime/actions/workflows/celerity_ci.yml/badge.svg)](https://github.com/celerity/celerity-runtime/actions/workflows/celerity_ci.yml) [![Coverage Status](https://coveralls.io/repos/github/celerity/celerity-runtime/badge.svg?branch=master)](https://coveralls.io/github/celerity/celerity-runtime?branch=master) [![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/celerity/celerity-runtime/blob/master/LICENSE) [![Semver 2.0](https://img.shields.io/badge/semver-2.0.0-blue)](https://semver.org/spec/v2.0.0.html) [![PRs # Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/celerity/celerity-runtime/blob/master/CONTRIBUTING.md) The Celerity distributed runtime and API aims to bring the power and ease of -use of [SYCL](https://sycl.tech) to distributed memory clusters. +use of [SYCL](https://sycl.tech) to multi-GPU systems and distributed memory +clusters, with transparent scaling. > If you want a step-by-step introduction on how to set up dependencies and > implement your first Celerity application, check out the @@ -47,7 +48,7 @@ queue.submit([&](celerity::handler& cgh) { 2. Submit a kernel to be executed by 1024 parallel _work items_. This kernel may be split across any number of nodes. -3. Kernels can be expressed as C++11 lambda functions, just like in SYCL. In +3. Kernels can be expressed as C++ lambda functions, just like in SYCL. In fact, no changes to your existing kernels are required. 4. Access your buffers as if they reside on a single device -- even though @@ -55,17 +56,20 @@ queue.submit([&](celerity::handler& cgh) { ### Run it like any other MPI application -The kernel shown above can be run on a single GPU, just like in SYCL, or on a -whole cluster -- without having to change anything about the program itself. +The kernel shown above can be run on a single GPU, just like in SYCL, on +multiple GPUs attached to a single shared memory system, or on a whole cluster + -- without having to change anything about the program itself. -For example, if we were to run it on two GPUs using `mpirun -n 2 ./my_example`, -the first GPU might compute the range `0-512` of the kernel, while the second -one computes `512-1024`. However, as the user, you don't have to care how +On a cluster with 2 nodes and a single GPU each, `mpirun -n 2 ./my_example` +might have the first GPU compute the range `0-512` of the kernel, while the +second one computes `512-1024`. However, as the user, you don't have to care how exactly your computation is being split up. +By default, a Celerity application will use all the attached GPUs, i.e. running +`./my_example` on a machine with 4 GPUs will automatically use all of them. + To see how you can use the result of your computation, look at some of our -fully-fledged [examples](examples), or follow the -[tutorial](docs/tutorial.md)! +fully-fledged [examples](examples), or follow the [tutorial](docs/tutorial.md)! ## Building Celerity @@ -79,9 +83,9 @@ installed first. - [AdaptiveCpp](https://github.com/AdaptiveCpp/AdaptiveCpp), - [DPC++](https://github.com/intel/llvm), or - [SimSYCL](https://github.com/celerity/SimSYCL) -- A MPI 2 implementation (tested with OpenMPI 4.0, MPICH 3.3 should work as well) - [CMake](https://www.cmake.org) (3.13 or newer) - A C++20 compiler +- [*optional*] An MPI 2 implementation (tested with OpenMPI 4.0, MPICH 3.3 should work as well) See the [platform support guide](docs/platform-support.md) on which library and OS versions are supported and automatically tested. @@ -107,25 +111,13 @@ function to set up the required dependencies for a target (no need to link manua ## Running a Celerity Application -Celerity is built on top of MPI, which means a Celerity application can be -executed like any other MPI application (i.e., using `mpirun` or equivalent). -There are several environment variables that you can use to influence -Celerity's runtime behavior: - -### Environment Variables - -- `CELERITY_LOG_LEVEL` controls the logging output level. One of `trace`, `debug`, - `info`, `warn`, `err`, `critical`, or `off`. -- `CELERITY_PROFILE_KERNEL` controls whether SYCL queue profiling information should be queried. -- `CELERITY_PRINT_GRAPHS` controls whether task and command graphs are logged - at the end of execution (requires log level `info` or higher). -- `CELERITY_DRY_RUN_NODES` takes a number and simulates a run with that many nodes - without actually executing the commands. -- `CELERITY_HORIZON_STEP` and `CELERITY_HORIZON_MAX_PARALLELISM` determine the - maximum number of sequential and parallel tasks, respectively, before a new - [horizon task](https://doi.org/10.1007/s42979-024-02749-w) is introduced. -- `CELERITY_TRACY` controls the Tracy profiler integration. Set to `off` to disable, - `fast` for light integration with little runtime overhead, and `full` for - integration with extensive performance debug information included in the trace. - Only available if integration was enabled enabled at build time through the - CMake option `-DCELERITY_TRACY_SUPPORT=ON`. +For the single-node case, you can simply run your application and it will +automatically use all available GPUs -- a simple way to limit this e.g. +for benchmarking is using the vendor-specific environment variables such +as `CUDA_VISIBLE_DEVICES`, `HIP_VISIBLE_DEVICES` or `ONEAPI_DEVICE_SELECTOR`. + +In the distributed memory cluster case, since celerity is built on top of MPI, a Celerity +application can be executed like any other MPI application (i.e., using `mpirun` or equivalent). + +There are also [several environment variables](docs/configuration.md) that you can use to influence +Celerity's runtime behavior. diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 000000000..8ba9a031e --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,42 @@ +--- +id: configuration +title: Configuration +sidebar_label: Configuration +--- + +After successfully [installing](installation.md) Celerity, you can tune its runtime behaviour via a number of environment variables. This page lists all available options. + +Note that same of these runtime options require a [corresponding CMake option](installation.md#additional-configuration-options) to be enabled during the build process. +This is generally the case if the option has a negative impact on performance, or if it is not required in most use cases. + +Celerity uses [libenvpp](https://github.com/ph3at/libenvpp) for environment variable handling, +which means that typos in environment variable names or invalid values will be detected and reported. + +## Environment Variables for Debugging and Profiling + +The following environment variables can be used to control Celerity's runtime behaviour, +specifically in development, debugging, and profiling scenarios: +| Option | Values | Description | +| --- | --- | --- | +| `CELERITY_LOG_LEVEL` | `trace`, `debug`, `info`, `warn`, `err`, `critical`, `off` | Controls the logging output level. | +| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. | +| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task and command graphs are logged at the end of execution (requires log level `info` or higher). | +| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing the commands. | +| `CELERITY_TRACY` | `off`, `fast`, `full` | Controls the Tracy profiler integration. Set to `off` to disable, `fast` for light integration with little runtime overhead, and `full` for integration with extensive performance debug information included in the trace. Only available if integration was enabled enabled at build time through the CMake option `-DCELERITY_TRACY_SUPPORT=ON`. + +## Environment Variables for Performance Tuning + +The following environment variables can be used to tune Celerity's performance. +Generally, these might need to be changed depending on the specific application and hardware setup to achieve the best possible performance, but the default values should work reasonably well in all cases: +| Option | Values | Description | +| --- | --- | --- | +| `CELERITY_HORIZON_STEP` | *number* | Determines the maximum number of sequential tasks before a new [horizon task](https://doi.org/10.1007/s42979-024-02749-w) is introduced. | +| `CELERITY_HORIZON_MAX_PARALLELISM` | *number* | Determines the maximum number of parallel tasks before a new horizon task is introduced. | +| `CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS` | `on`, `off` | Controls whether device commands are submitted in a separate backend thread for each local device. This improves performance particularly in cases where kernel runtimes are very short. (default: `on`) | +| `CELERITY_THREAD_PINNING` | `off`, `auto`, `from:#`, *core list* | Controls if and how threads are pinned to CPU cores. `off` disables pinning, `auto` lets Celerity decide, `from:#` starts pinning sequentially from the given core, and a core list specifies the exact pinning (see below). (default: `auto`) | + +### Notes on Core Pinning + +Some Celerity threads benefit greatly from rapid communication, and we have observed performance differences of up to 50% for very fine-grained applications when pinning threads to specific CPU cores. The `CELERITY_THREAD_PINNING` environment variable can be set to a list of CPU cores to which Celerity should pin its threads. + +In most cases, `auto` should provide very good results. However, if you want to manually specify the core list, you can do so by providing a comma-separated list of core numbers. The length of this list needs to precisely match the number of threads and order which Celerity is using -- for detailed information consult the definition of `celerity::detail::thread_pinning::thread_type`. diff --git a/docs/installation.md b/docs/installation.md index e8fb4f852..59a742d0a 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -8,9 +8,9 @@ Celerity can be built and installed from [source](https://github.com/celerity/celerity-runtime) using [CMake](https://cmake.org). It requires the following dependencies: -- A MPI 2 implementation (for example [OpenMPI 4](https://www.open-mpi.org)) - A C++20 compiler - A supported SYCL implementation (see below) +- [*optional*] An MPI 2 implementation (for example [OpenMPI 4](https://www.open-mpi.org)) Note that while Celerity does support compilation and execution on Windows in principle, in this documentation we focus exclusively on Linux, as it @@ -18,7 +18,7 @@ represents the de-facto standard in HPC nowadays. ## Picking a SYCL Implementation -Celerity currently supports two different SYCL implementations. If you're +Celerity currently supports three different SYCL implementations. If you're simply giving Celerity a try, the choice does not matter all that much. For more advanced use cases or specific hardware setups it might however make sense to prefer one over the other. @@ -46,9 +46,16 @@ result in mysterious segfaults in the DPC++ SYCL library!) > Celerity works with DPC++ on Linux. +### SimSYCL + +[SimSYCL](https://github.com/celerity/SimSYCL) is a SYCL implementation that +is focused on development and verification of SYCL applications. Performance +is not a goal, and only CPUs are supported. It is a great choice for developing, +debugging, and testing your Celerity applications on small data sets. + Until its discontinuation in July 2023, Celerity also supported ComputeCpp as a SYCL implementation. -## Configuring CMake +## Configuring Your Build After installing all of the aforementioned dependencies, clone (we recommend using `git clone --recurse-submodules`) or download @@ -76,24 +83,24 @@ cmake -G "Unix Makefiles" .. -DCMAKE_CXX_COMPILER="/path/to/dpc++/bin/clang++" - In case multiple SYCL implementations are in CMake's search path, you can disambiguate them -using `-DCELERITY_SYCL_IMPL=AdaptiveCpp|DPC++`. +using `-DCELERITY_SYCL_IMPL=AdaptiveCpp|DPC++|SimSYCL`. Note that the `CMAKE_PREFIX_PATH` parameter should only be required if you installed SYCL in a non-standard location. See the [CMake documentation](https://cmake.org/documentation/) as well as the documentation for your SYCL implementation for more information on the other parameters. -Celerity comes with several example applications that are built by default. -If you don't want to build examples, provide `-DCELERITY_BUILD_EXAMPLES=0` as -an additional parameter to your CMake configuration call. - -Celerity supports runtime and application profiling with [Tracy](https://github.com/wolfpld/tracy). -The integration is disabled by default, to enable configure the build with `-DCELERITY_TRACY_SUPPORT=1`. At runtime, -it must then be enabled with the `CELERITY_TRACY` environment variable (see [README](../README.md)). +### Additional Configuration Options -By default Celerity is built for distributed systems with MPI pre-installed. -If you intend to run on a single-node system without MPI support, specify -`-DCELERITY_ENABLE_MPI=0` at configuration time. +The following additional CMake options are available: +| Option | Values | Description | +| --- | --- | --- | +| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | 0, 1 | Diagnose uninitialized reads and overlapping writes (default: 1 for debug builds, 0 for release builds) | +| CELERITY_ACCESSOR_BOUNDARY_CHECK | 0, 1 | Enable boundary checks for accessors (default: 1 for debug builds, 0 for release builds) | +| CELERITY_BUILD_EXAMPLES | 0, 1 | Build the example applications (default: 1) | +| CELERITY_ENABLE_MPI | 0, 1 | Enable MPI support (default: 1) | +| CELERITY_TRACY_SUPPORT | 0, 1 | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: 0) | +| CELERITY_USE_MIMALLOC | 0, 1 | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: 1) | ## Building and Installing @@ -109,15 +116,21 @@ If you have configured CMake to build the Celerity example applications, you can now run them from within the build directory. For example, try running: ``` -mpirun -n 2 ./examples/matmul/matmul +./examples/matmul/matmul ``` > **Tip:** You might also want to try and run the unit tests that come with Celerity. > To do so, simply run `ninja test` or `make test`. +## Runtime Configuration + +Celerity comes with a number of runtime configuration options that can be +set via environment variables. Learn more about them [here](configuration.md). + ## Bootstrap your own Application All projects in the `examples/` directory are stand-alone Celerity programs – if you like a template for getting started, just copy one of them to bootstrap on your own Celerity application. You can find out more about that [here](https://github.com/celerity/celerity-runtime/blob/master/examples). +