Skip to content

Commit

Permalink
Incorporated additional review feedback, updated benchmark results wi…
Browse files Browse the repository at this point in the history
…th everything pinned
  • Loading branch information
PeterTh committed Nov 20, 2024
1 parent 7622cd3 commit 7aefc24
Show file tree
Hide file tree
Showing 10 changed files with 471 additions and 450 deletions.
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,13 +111,15 @@ function to set up the required dependencies for a target (no need to link manua
## Running a Celerity Application
For the single-node case, you can simply run your application and it will
automatically use all available GPUs -- a simple way to limit this e.g.
for benchmarking is using the vendor-specific environment variables such
When running on a single machine, simply execute your application as you
normally would -- no special invocation required. By default, the runtime
will then attempt use all available GPUs. A simple way of limiting this,
e.g. for benchmarking, is to use vendor-specific environment variables such
as `CUDA_VISIBLE_DEVICES`, `HIP_VISIBLE_DEVICES` or `ONEAPI_DEVICE_SELECTOR`.
In the distributed memory cluster case, since celerity is built on top of MPI, a Celerity
application can be executed like any other MPI application (i.e., using `mpirun` or equivalent).
When targeting distributed memory clusters, a Celerity application can be
executed like any other MPI-based application (i.e., using `mpirun` or equivalent).
There are also [several environment variables](docs/configuration.md) that you can use to influence
Celerity's runtime behavior.
There are also [several environment variables](docs/configuration.md) which influence
Celerity's runtime behavior. Tweaking these variables can be useful to tailor
performance to specific systems, as well as for debugging Celerity applications.
382 changes: 191 additions & 191 deletions ci/perf/gpuc2_bench.csv

Large diffs are not rendered by default.

384 changes: 192 additions & 192 deletions ci/perf/gpuc2_bench.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ specifically in development, debugging, and profiling scenarios:
| Option | Values | Description |
| --- | --- | --- |
| `CELERITY_LOG_LEVEL` | `trace`, `debug`, `info`, `warn`, `err`, `critical`, `off` | Controls the logging output level. |
| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. |
| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task and command graphs are logged at the end of execution (requires log level `info` or higher). |
| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing the commands. |
| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. This typically incurs additional overhead for each kernel launch. |
| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task, command and instruction graphs are logged in Graphviz format at the end of execution (requires log level `info` or higher). Note that these can quickly become quite large, even for small applications. |
| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing any instructions (allocations, kernels, host tasks, etc). Useful for investigating performance characteristics of the runtime itself. |
| `CELERITY_TRACY` | `off`, `fast`, `full` | Controls the Tracy profiler integration. Set to `off` to disable, `fast` for light integration with little runtime overhead, and `full` for integration with extensive performance debug information included in the trace. Only available if integration was enabled enabled at build time through the CMake option `-DCELERITY_TRACY_SUPPORT=ON`.

## Environment Variables for Performance Tuning
Expand Down
14 changes: 7 additions & 7 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,14 @@ for your SYCL implementation for more information on the other parameters.
### Additional Configuration Options

The following additional CMake options are available:
| Option | Values | Description |
| Option | Type | Description |
| --- | --- | --- |
| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | 0, 1 | Diagnose uninitialized reads and overlapping writes (default: 1 for debug builds, 0 for release builds) |
| CELERITY_ACCESSOR_BOUNDARY_CHECK | 0, 1 | Enable boundary checks for accessors (default: 1 for debug builds, 0 for release builds) |
| CELERITY_BUILD_EXAMPLES | 0, 1 | Build the example applications (default: 1) |
| CELERITY_ENABLE_MPI | 0, 1 | Enable MPI support (default: 1) |
| CELERITY_TRACY_SUPPORT | 0, 1 | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: 0) |
| CELERITY_USE_MIMALLOC | 0, 1 | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: 1) |
| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | `BOOL` | Diagnose uninitialized reads and overlapping writes (default: `ON` for debug builds, `OFF` for release builds) |
| CELERITY_ACCESSOR_BOUNDARY_CHECK | `BOOL` | Enable boundary checks for accessors (default: `ON` for debug builds, `OFF` for release builds) |
| CELERITY_BUILD_EXAMPLES | `BOOL` | Build the example applications (default: `ON`) |
| CELERITY_ENABLE_MPI | `BOOL` | Enable MPI support (default: `ON`) |
| CELERITY_TRACY_SUPPORT | `BOOL` | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: `OFF`) |
| CELERITY_USE_MIMALLOC | `BOOL` | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: `ON`) |

## Building and Installing

Expand Down
17 changes: 10 additions & 7 deletions include/affinity.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,23 @@ namespace celerity::detail::thread_pinning {

constexpr uint32_t thread_type_step = 10000;

// The threads Celerity interacts with ("user") and creates (everything else), identified for the purpose of pinning.
// The threads Celerity interacts with ("application") and creates (everything else), identified for the purpose of pinning.
// Note: this is not an enum class to make interactions such as specifying `first_backend_worker+i` easier
enum thread_type : uint32_t {
application = 0 * thread_type_step,
scheduler = 1 * thread_type_step,
executor = 2 * thread_type_step,
first_backend_worker = 3 * thread_type_step,
first_device_submitter = 3 * thread_type_step,
first_host_queue = 4 * thread_type_step,
max = 5 * thread_type_step,
};
std::string thread_type_to_string(const thread_type t_type);

// User-level configuration of the thread pinning mechanism (set by the user via environment variables)
struct environment_configuration {
bool enabled = true; // we want thread pinning to be enabled by default
uint32_t starting_from_core = 1;
std::vector<uint32_t> hardcoded_core_ids;
bool enabled = true; // we want thread pinning to be enabled by default
uint32_t starting_from_core = 1; // we default to starting from core 1 since core 0 is frequently used by some processes
std::vector<uint32_t> hardcoded_core_ids; // starts empty, which means no hardcoded IDs are used
};

// Parses and validates the environment variable string, returning the corresponding configuration
Expand All @@ -43,7 +43,7 @@ environment_configuration parse_validate_env(const std::string_view str);
// Configures the pinning mechanism
// For now, only "standard" threads are pinned
// these are threads that benefit from rapid communication between each other,
// i.e. scheduler -> executor -> backend workers
// i.e. applciation -> scheduler -> executor -> device submission threads
// Extensible for future use where some threads might benefit from NUMA-aware per-GPU pinning
struct runtime_configuration {
// Whether or not to perform pinning
Expand All @@ -54,7 +54,7 @@ struct runtime_configuration {
// Whether backend device submission threads are used and need to have cores allocated to them
bool use_backend_device_submission_threads = true;

// Number of processes running in legacy mode
// Number of processes running in legacy mode on this machine
uint32_t num_legacy_processes = 1;
// Process index of current process running in legacy mode
uint32_t legacy_process_index = 0;
Expand All @@ -78,6 +78,9 @@ class thread_pinner {
thread_pinner& operator=(const thread_pinner&) = delete;
thread_pinner(thread_pinner&&) = default;
thread_pinner& operator=(thread_pinner&&) = default;

private:
bool m_successfully_initialized = false;
};

// Pins the invoking thread of type `t_type` according to the current configuration
Expand Down
5 changes: 2 additions & 3 deletions src/affinity.cc
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ std::string thread_type_to_string(const thread_type t_type) {
case thread_type::executor: return "executor";
default: break;
}
if(t_type >= thread_type::first_backend_worker && t_type < thread_type::first_host_queue) {
return fmt::format("backend_worker_{}", t_type - thread_type::first_backend_worker);
if(t_type >= thread_type::first_device_submitter && t_type < thread_type::first_host_queue) {
return fmt::format("device_submitter_{}", t_type - thread_type::first_device_submitter);
}
if(t_type >= thread_type::first_host_queue && t_type < thread_type::max) { return fmt::format("host_queue_{}", t_type - thread_type::first_host_queue); }
return fmt::format("unknown({})", static_cast<uint32_t>(t_type));
Expand Down Expand Up @@ -77,7 +77,6 @@ environment_configuration parse_validate_env(const std::string_view str) {
try {
return {env::default_parser<bool>{}(str), auto_start_from_core, {}};
} catch(const env::parser_error& e) { throw env::parser_error{fmt::format(error_msg, e.what())}; }
return {};
}

} // namespace celerity::detail::thread_pinning
2 changes: 1 addition & 1 deletion src/backend/sycl_backend.cc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ sycl_backend::sycl_backend(const std::vector<sycl::device>& devices, const confi
m_impl->devices[did].submission_thread.emplace(fmt::format("cy-be-submission-{}", did.value), m_impl->config.profiling);
// no need to wait for the event -> will happen before the first task is submitted
(void)m_impl->devices[did].submission_thread->submit([did] {
thread_pinning::pin_this_thread(thread_pinning::thread_type(thread_pinning::thread_type::first_backend_worker + did.value));
thread_pinning::pin_this_thread(thread_pinning::thread_type(thread_pinning::thread_type::first_device_submitter + did.value));
closure_hydrator::make_available();
});
}
Expand Down
43 changes: 25 additions & 18 deletions src/platform_specific/affinity.unix.cc
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,13 @@ using namespace celerity::detail::thread_pinning;
std::vector<uint32_t> get_available_sequential_cores(const cpu_set_t& available_cores, const uint32_t count, const uint32_t starting_from_core) {
std::vector<uint32_t> cores;
uint32_t current_core = starting_from_core;
uint32_t assigned_cores = 0;
for(uint32_t i = 0; i < count; ++i) {
// find the next sequential core we may use
while(CPU_ISSET(current_core, &available_cores) == 0 && current_core < CPU_SETSIZE) {
current_core++;
}
if(current_core >= CPU_SETSIZE) {
CELERITY_WARN("Ran out of available cores for thread pinning after assigning {} cores, will disable pinning.", assigned_cores);
return {};
}
if(current_core >= CPU_SETSIZE) { return {}; }
cores.push_back(current_core++);
++assigned_cores;
}
return cores;
}
Expand Down Expand Up @@ -66,42 +61,51 @@ thread_local std::optional<thread_remover> t_remover; // NOLINT(cppcoreguideline
// Initializes the thread pinning machinery
// This captures the current thread's affinity mask and sets the thread pinning machinery up
// Calls to pin_this_thread prior to this call will have no effect
void initialize(const runtime_configuration& cfg) {
bool initialize(const runtime_configuration& cfg) {
std::lock_guard lock(g_state.mutex);
assert(!g_state.initialized && "Thread pinning already initialized.");
if(g_state.initialized) {
CELERITY_ERROR("Thread pinning already initialized. Ignoring this initialization attempt.");
return false;
}
assert(g_state.thread_pinning_plan.empty() && "Thread pinning plan not initially empty.");
assert(g_state.pinned_threads.empty() && "Pinned threads not initially empty.");

g_state.config = cfg;

const auto ret = sched_getaffinity(0, sizeof(cpu_set_t), &g_state.available_cores);
if(ret != 0) {
CELERITY_WARN("Error retrieving initial process affinity mask, will disable pinning.");
CELERITY_WARN("Error retrieving initial process affinity mask. Unable to check whether enough logical cores are available to this process.{}",
cfg.enabled ? " Will disable thread pinning." : "");
g_state.config.enabled = false;
return;
return true;
}

// pinned threads per process: user, scheduler, executor, 1 backend worker per device if enabled
// pinned threads per process: application, scheduler, executor, 1 device submitter per device if enabled
uint32_t pinned_threads_per_process = 3;
if(g_state.config.use_backend_device_submission_threads) { pinned_threads_per_process += g_state.config.num_devices; }
// total number of threads to be pinned across processes (legacy mode)
// total number of threads to be pinned across processes (legacy mode) - we assume that each process has been assigned the same number of device
const uint32_t total_threads = pinned_threads_per_process * g_state.config.num_legacy_processes;

if(g_state.config.enabled) {
// select the core set to use
std::vector<uint32_t> selected_core_ids = {};
if(!cfg.hardcoded_core_ids.empty()) {
// just use the provided hardcoded IDs
// attempt to use the provided hardcoded IDs if they match the number of threads to be pinned
if(static_cast<uint32_t>(cfg.hardcoded_core_ids.size()) != total_threads) {
CELERITY_WARN("Hardcoded core ID count ({}) does not match the number of threads to be pinned ({}), will disable pinning.",
CELERITY_WARN("Hardcoded core ID count ({}) does not match the number of threads to be pinned ({}), downgrading to auto-pinning.",
cfg.hardcoded_core_ids.size(), total_threads);
} else {
selected_core_ids = cfg.hardcoded_core_ids;
}
} else {
}
if(selected_core_ids.empty()) {
// otherwise, sequential core assignments for now; it is most important that each of the threads is "close"
// to the ones next to it in this sequence, so that communication between them is fast
selected_core_ids = get_available_sequential_cores(g_state.available_cores, total_threads, cfg.standard_core_start_id);
if(selected_core_ids.empty()) {
CELERITY_WARN("Insufficient available cores for thread pinning (required {}, {} available), disabling pinning.", //
total_threads, CPU_COUNT(&g_state.available_cores));
}
}
// build our pinning plan based on the selected core list
if(selected_core_ids.empty()) {
Expand All @@ -114,7 +118,7 @@ void initialize(const runtime_configuration& cfg) {
if(g_state.config.use_backend_device_submission_threads) {
for(uint32_t i = 0; i < g_state.config.num_devices; ++i) {
const auto device_tid =
static_cast<thread_type>(thread_type::first_backend_worker + i); // NOLINT(clang-analyzer-optin.core.EnumCastOutOfRange)
static_cast<thread_type>(thread_type::first_device_submitter + i); // NOLINT(clang-analyzer-optin.core.EnumCastOutOfRange)
g_state.thread_pinning_plan.emplace(device_tid, selected_core_ids[current_core_id++]);
}
}
Expand All @@ -130,6 +134,7 @@ void initialize(const runtime_configuration& cfg) {
}

g_state.initialized = true;
return true;
}

// Tears down the thread pinning machinery
Expand Down Expand Up @@ -166,8 +171,10 @@ void teardown() {

namespace celerity::detail::thread_pinning {

thread_pinner::thread_pinner(const runtime_configuration& cfg) { initialize(cfg); }
thread_pinner::~thread_pinner() { teardown(); }
thread_pinner::thread_pinner(const runtime_configuration& cfg) : m_successfully_initialized(initialize(cfg)) {}
thread_pinner::~thread_pinner() {
if(m_successfully_initialized) { teardown(); }
}

void pin_this_thread(const thread_type t_type) {
std::lock_guard lock(g_state.mutex);
Expand Down
Loading

0 comments on commit 7aefc24

Please sign in to comment.