Skip to content

Commit

Permalink
Updated documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
PeterTh committed Nov 19, 2024
1 parent 0f4ff05 commit 749bc6d
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 48 deletions.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ Versioning](http://semver.org/spec/v2.0.0.html).
- Experimental support for the AdaptiveCpp generic single-pass compiler (#294)
- Constructor overloads to the `access::neighborhood` range mapper for reads in 3/5/7-point stencil codes (#292)
- The SYCL backend now uses per-device submission threads to dispatch commands for better performance.
This new behaviour is enabled by default, and can be disabled via "CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS" (#303)
This new behaviour is enabled by default, and can be disabled via `CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS` (#303)
- Celerity now has a thread pinning mechanism to control how threads are pinned to CPU cores.
This can be controlled via the `CELERITY_THREAD_PINNING` environment variable (#309)

### Changed

Expand All @@ -30,6 +32,7 @@ Versioning](http://semver.org/spec/v2.0.0.html).
This operation has a much more pronounced performance penalty than its SYCL counterpart (#283)
- On systems that do not support device-to-device copies, data is now staged in linearized buffers for better performance (#287)
- The `access::neighborhood` built-in range mapper now receives a `range` instead of a coordinate list (#292)
- Overhauled the [installation](docs/installation.md) and [configuration](docs/configuration.md) documentation (#309)

### Fixed

Expand Down
56 changes: 24 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
# Celerity Runtime — [![CI Workflow](https://github.com/celerity/celerity-runtime/actions/workflows/celerity_ci.yml/badge.svg)](https://github.com/celerity/celerity-runtime/actions/workflows/celerity_ci.yml) [![Coverage Status](https://coveralls.io/repos/github/celerity/celerity-runtime/badge.svg?branch=master)](https://coveralls.io/github/celerity/celerity-runtime?branch=master) [![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/celerity/celerity-runtime/blob/master/LICENSE) [![Semver 2.0](https://img.shields.io/badge/semver-2.0.0-blue)](https://semver.org/spec/v2.0.0.html) [![PRs # Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/celerity/celerity-runtime/blob/master/CONTRIBUTING.md)

The Celerity distributed runtime and API aims to bring the power and ease of
use of [SYCL](https://sycl.tech) to distributed memory clusters.
use of [SYCL](https://sycl.tech) to multi-GPU systems and distributed memory
clusters, with transparent scaling.

> If you want a step-by-step introduction on how to set up dependencies and
> implement your first Celerity application, check out the
Expand Down Expand Up @@ -47,25 +48,28 @@ queue.submit([&](celerity::handler& cgh) {
2. Submit a kernel to be executed by 1024 parallel _work items_. This kernel
may be split across any number of nodes.
3. Kernels can be expressed as C++11 lambda functions, just like in SYCL. In
3. Kernels can be expressed as C++ lambda functions, just like in SYCL. In
fact, no changes to your existing kernels are required.
4. Access your buffers as if they reside on a single device -- even though
they might be scattered throughout the cluster.
### Run it like any other MPI application
The kernel shown above can be run on a single GPU, just like in SYCL, or on a
whole cluster -- without having to change anything about the program itself.
The kernel shown above can be run on a single GPU, just like in SYCL, on
multiple GPUs attached to a single shared memory system, or on a whole cluster
-- without having to change anything about the program itself.
For example, if we were to run it on two GPUs using `mpirun -n 2 ./my_example`,
the first GPU might compute the range `0-512` of the kernel, while the second
one computes `512-1024`. However, as the user, you don't have to care how
On a cluster with 2 nodes and a single GPU each, `mpirun -n 2 ./my_example`
might have the first GPU compute the range `0-512` of the kernel, while the
second one computes `512-1024`. However, as the user, you don't have to care how
exactly your computation is being split up.
By default, a Celerity application will use all the attached GPUs, i.e. running
`./my_example` on a machine with 4 GPUs will automatically use all of them.
To see how you can use the result of your computation, look at some of our
fully-fledged [examples](examples), or follow the
[tutorial](docs/tutorial.md)!
fully-fledged [examples](examples), or follow the [tutorial](docs/tutorial.md)!
## Building Celerity
Expand All @@ -79,9 +83,9 @@ installed first.
- [AdaptiveCpp](https://github.com/AdaptiveCpp/AdaptiveCpp),
- [DPC++](https://github.com/intel/llvm), or
- [SimSYCL](https://github.com/celerity/SimSYCL)
- A MPI 2 implementation (tested with OpenMPI 4.0, MPICH 3.3 should work as well)
- [CMake](https://www.cmake.org) (3.13 or newer)
- A C++20 compiler
- [*optional*] An MPI 2 implementation (tested with OpenMPI 4.0, MPICH 3.3 should work as well)
See the [platform support guide](docs/platform-support.md) on which library and OS versions are supported and
automatically tested.
Expand All @@ -107,25 +111,13 @@ function to set up the required dependencies for a target (no need to link manua
## Running a Celerity Application
Celerity is built on top of MPI, which means a Celerity application can be
executed like any other MPI application (i.e., using `mpirun` or equivalent).
There are several environment variables that you can use to influence
Celerity's runtime behavior:
### Environment Variables
- `CELERITY_LOG_LEVEL` controls the logging output level. One of `trace`, `debug`,
`info`, `warn`, `err`, `critical`, or `off`.
- `CELERITY_PROFILE_KERNEL` controls whether SYCL queue profiling information should be queried.
- `CELERITY_PRINT_GRAPHS` controls whether task and command graphs are logged
at the end of execution (requires log level `info` or higher).
- `CELERITY_DRY_RUN_NODES` takes a number and simulates a run with that many nodes
without actually executing the commands.
- `CELERITY_HORIZON_STEP` and `CELERITY_HORIZON_MAX_PARALLELISM` determine the
maximum number of sequential and parallel tasks, respectively, before a new
[horizon task](https://doi.org/10.1007/s42979-024-02749-w) is introduced.
- `CELERITY_TRACY` controls the Tracy profiler integration. Set to `off` to disable,
`fast` for light integration with little runtime overhead, and `full` for
integration with extensive performance debug information included in the trace.
Only available if integration was enabled enabled at build time through the
CMake option `-DCELERITY_TRACY_SUPPORT=ON`.
For the single-node case, you can simply run your application and it will
automatically use all available GPUs -- a simple way to limit this e.g.
for benchmarking is using the vendor-specific environment variables such
as `CUDA_VISIBLE_DEVICES`, `HIP_VISIBLE_DEVICES` or `ONEAPI_DEVICE_SELECTOR`.
In the distributed memory cluster case, since celerity is built on top of MPI, a Celerity
application can be executed like any other MPI application (i.e., using `mpirun` or equivalent).
There are also [several environment variables](docs/configuration.md) that you can use to influence
Celerity's runtime behavior.
42 changes: 42 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
id: configuration
title: Configuration
sidebar_label: Configuration
---

After successfully [installing](installation.md) Celerity, you can tune its runtime behaviour via a number of environment variables. This page lists all available options.

Note that same of these runtime options require a [corresponding CMake option](installation.md#additional-configuration-options) to be enabled during the build process.
This is generally the case if the option has a negative impact on performance, or if it is not required in most use cases.

Celerity uses [libenvpp](https://github.com/ph3at/libenvpp) for environment variable handling,
which means that typos in environment variable names or invalid values will be detected and reported.

## Environment Variables for Debugging and Profiling

The following environment variables can be used to control Celerity's runtime behaviour,
specifically in development, debugging, and profiling scenarios:
| Option | Values | Description |
| --- | --- | --- |
| `CELERITY_LOG_LEVEL` | `trace`, `debug`, `info`, `warn`, `err`, `critical`, `off` | Controls the logging output level. |
| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. |
| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task and command graphs are logged at the end of execution (requires log level `info` or higher). |
| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing the commands. |
| `CELERITY_TRACY` | `off`, `fast`, `full` | Controls the Tracy profiler integration. Set to `off` to disable, `fast` for light integration with little runtime overhead, and `full` for integration with extensive performance debug information included in the trace. Only available if integration was enabled enabled at build time through the CMake option `-DCELERITY_TRACY_SUPPORT=ON`.

## Environment Variables for Performance Tuning

The following environment variables can be used to tune Celerity's performance.
Generally, these might need to be changed depending on the specific application and hardware setup to achieve the best possible performance, but the default values should work reasonably well in all cases:
| Option | Values | Description |
| --- | --- | --- |
| `CELERITY_HORIZON_STEP` | *number* | Determines the maximum number of sequential tasks before a new [horizon task](https://doi.org/10.1007/s42979-024-02749-w) is introduced. |
| `CELERITY_HORIZON_MAX_PARALLELISM` | *number* | Determines the maximum number of parallel tasks before a new horizon task is introduced. |
| `CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS` | `on`, `off` | Controls whether device commands are submitted in a separate backend thread for each local device. This improves performance particularly in cases where kernel runtimes are very short. (default: `on`) |
| `CELERITY_THREAD_PINNING` | `off`, `auto`, `from:#`, *core list* | Controls if and how threads are pinned to CPU cores. `off` disables pinning, `auto` lets Celerity decide, `from:#` starts pinning sequentially from the given core, and a core list specifies the exact pinning (see below). (default: `auto`) |

### Notes on Core Pinning

Some Celerity threads benefit greatly from rapid communication, and we have observed performance differences of up to 50% for very fine-grained applications when pinning threads to specific CPU cores. The `CELERITY_THREAD_PINNING` environment variable can be set to a list of CPU cores to which Celerity should pin its threads.

In most cases, `auto` should provide very good results. However, if you want to manually specify the core list, you can do so by providing a comma-separated list of core numbers. The length of this list needs to precisely match the number of threads and order which Celerity is using -- for detailed information consult the definition of `celerity::detail::thread_pinning::thread_type`.
43 changes: 28 additions & 15 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@ Celerity can be built and installed from
[source](https://github.com/celerity/celerity-runtime) using
[CMake](https://cmake.org). It requires the following dependencies:

- A MPI 2 implementation (for example [OpenMPI 4](https://www.open-mpi.org))
- A C++20 compiler
- A supported SYCL implementation (see below)
- [*optional*] An MPI 2 implementation (for example [OpenMPI 4](https://www.open-mpi.org))

Note that while Celerity does support compilation and execution on Windows in
principle, in this documentation we focus exclusively on Linux, as it
represents the de-facto standard in HPC nowadays.

## Picking a SYCL Implementation

Celerity currently supports two different SYCL implementations. If you're
Celerity currently supports three different SYCL implementations. If you're
simply giving Celerity a try, the choice does not matter all that much. For
more advanced use cases or specific hardware setups it might however make
sense to prefer one over the other.
Expand Down Expand Up @@ -46,9 +46,16 @@ result in mysterious segfaults in the DPC++ SYCL library!)

> Celerity works with DPC++ on Linux.
### SimSYCL

[SimSYCL](https://github.com/celerity/SimSYCL) is a SYCL implementation that
is focused on development and verification of SYCL applications. Performance
is not a goal, and only CPUs are supported. It is a great choice for developing,
debugging, and testing your Celerity applications on small data sets.

Until its discontinuation in July 2023, Celerity also supported ComputeCpp as a SYCL implementation.

## Configuring CMake
## Configuring Your Build

After installing all of the aforementioned dependencies, clone (we recommend
using `git clone --recurse-submodules`) or download
Expand Down Expand Up @@ -76,24 +83,24 @@ cmake -G "Unix Makefiles" .. -DCMAKE_CXX_COMPILER="/path/to/dpc++/bin/clang++" -
<!--END_DOCUSAURUS_CODE_TABS-->

In case multiple SYCL implementations are in CMake's search path, you can disambiguate them
using `-DCELERITY_SYCL_IMPL=AdaptiveCpp|DPC++`.
using `-DCELERITY_SYCL_IMPL=AdaptiveCpp|DPC++|SimSYCL`.

Note that the `CMAKE_PREFIX_PATH` parameter should only be required if you
installed SYCL in a non-standard location. See the [CMake
documentation](https://cmake.org/documentation/) as well as the documentation
for your SYCL implementation for more information on the other parameters.

Celerity comes with several example applications that are built by default.
If you don't want to build examples, provide `-DCELERITY_BUILD_EXAMPLES=0` as
an additional parameter to your CMake configuration call.

Celerity supports runtime and application profiling with [Tracy](https://github.com/wolfpld/tracy).
The integration is disabled by default, to enable configure the build with `-DCELERITY_TRACY_SUPPORT=1`. At runtime,
it must then be enabled with the `CELERITY_TRACY` environment variable (see [README](../README.md)).
### Additional Configuration Options

By default Celerity is built for distributed systems with MPI pre-installed.
If you intend to run on a single-node system without MPI support, specify
`-DCELERITY_ENABLE_MPI=0` at configuration time.
The following additional CMake options are available:
| Option | Values | Description |
| --- | --- | --- |
| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | 0, 1 | Diagnose uninitialized reads and overlapping writes (default: 1 for debug builds, 0 for release builds) |
| CELERITY_ACCESSOR_BOUNDARY_CHECK | 0, 1 | Enable boundary checks for accessors (default: 1 for debug builds, 0 for release builds) |
| CELERITY_BUILD_EXAMPLES | 0, 1 | Build the example applications (default: 1) |
| CELERITY_ENABLE_MPI | 0, 1 | Enable MPI support (default: 1) |
| CELERITY_TRACY_SUPPORT | 0, 1 | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: 0) |
| CELERITY_USE_MIMALLOC | 0, 1 | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: 1) |

## Building and Installing

Expand All @@ -109,15 +116,21 @@ If you have configured CMake to build the Celerity example applications, you
can now run them from within the build directory. For example, try running:

```
mpirun -n 2 ./examples/matmul/matmul
./examples/matmul/matmul
```

> **Tip:** You might also want to try and run the unit tests that come with Celerity.
> To do so, simply run `ninja test` or `make test`.
## Runtime Configuration

Celerity comes with a number of runtime configuration options that can be
set via environment variables. Learn more about them [here](configuration.md).

## Bootstrap your own Application

All projects in the `examples/` directory are stand-alone Celerity programs
– if you like a template for getting started, just copy one of them to
bootstrap on your own Celerity application. You can find out more about that
[here](https://github.com/celerity/celerity-runtime/blob/master/examples).

0 comments on commit 749bc6d

Please sign in to comment.