Incorporated additional review feedback, updated benchmark results wi…

…th everything pinned
celerity · Nov 20, 2024 · 7aefc24 · 7aefc24
1 parent 7622cd3
commit 7aefc24
Show file tree

Hide file tree

Showing 10 changed files with 471 additions and 450 deletions.
diff --git a/README.md b/README.md
@@ -111,13 +111,15 @@ function to set up the required dependencies for a target (no need to link manua
 
 ## Running a Celerity Application
 
-For the single-node case, you can simply run your application and it will
-automatically use all available GPUs -- a simple way to limit this e.g.
-for benchmarking is using the vendor-specific environment variables such
+When running on a single machine, simply execute your application as you
+normally would -- no special invocation required. By default, the runtime
+will then attempt use all available GPUs. A simple way of limiting this,
+e.g. for benchmarking, is to use vendor-specific environment variables such
 as `CUDA_VISIBLE_DEVICES`, `HIP_VISIBLE_DEVICES` or `ONEAPI_DEVICE_SELECTOR`.
 
-In the distributed memory cluster case, since celerity is built on top of MPI, a Celerity
-application can be executed like any other MPI application (i.e., using `mpirun` or equivalent).
+When targeting distributed memory clusters, a Celerity application can be 
+executed like any other MPI-based application (i.e., using `mpirun` or equivalent).
 
-There are also [several environment variables](docs/configuration.md) that you can use to influence
-Celerity's runtime behavior.
+There are also [several environment variables](docs/configuration.md) which influence
+Celerity's runtime behavior. Tweaking these variables can be useful to tailor
+performance to specific systems, as well as for debugging Celerity applications.
diff --git a/ci/perf/gpuc2_bench.csv b/ci/perf/gpuc2_bench.csv
diff --git a/ci/perf/gpuc2_bench.md b/ci/perf/gpuc2_bench.md
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -19,9 +19,9 @@ specifically in development, debugging, and profiling scenarios:
 | Option | Values | Description |
 | --- | --- | --- |
 | `CELERITY_LOG_LEVEL` | `trace`, `debug`, `info`, `warn`, `err`, `critical`, `off` | Controls the logging output level. |
-| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. |
-| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task and command graphs are logged at the end of execution (requires log level `info` or higher). |
-| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing the commands. |
+| `CELERITY_PROFILE_KERNEL` | `on`, `off` | Controls whether SYCL queue profiling information should be queried. This typically incurs additional overhead for each kernel launch. |
+| `CELERITY_PRINT_GRAPHS` | `on`, `off` | Controls whether task, command and instruction graphs are logged in Graphviz format at the end of execution (requires log level `info` or higher). Note that these can quickly become quite large, even for small applications. |
+| `CELERITY_DRY_RUN_NODES` | *number* | Simulates a run with the given number of nodes without actually executing any instructions (allocations, kernels, host tasks, etc). Useful for investigating performance characteristics of the runtime itself. |
 | `CELERITY_TRACY` | `off`, `fast`, `full` | Controls the Tracy profiler integration. Set to `off` to disable, `fast` for light integration with little runtime overhead, and `full` for integration with extensive performance debug information included in the trace. Only available if integration was enabled enabled at build time through the CMake option `-DCELERITY_TRACY_SUPPORT=ON`.
 
 ## Environment Variables for Performance Tuning

diff --git a/docs/installation.md b/docs/installation.md
@@ -93,14 +93,14 @@ for your SYCL implementation for more information on the other parameters.
 ### Additional Configuration Options
 
 The following additional CMake options are available:
-| Option | Values | Description |
+| Option | Type | Description |
 | --- | --- | --- |
-| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | 0, 1 | Diagnose uninitialized reads and overlapping writes (default: 1 for debug builds, 0 for release builds) |
-| CELERITY_ACCESSOR_BOUNDARY_CHECK | 0, 1 | Enable boundary checks for accessors (default: 1 for debug builds, 0 for release builds) |
-| CELERITY_BUILD_EXAMPLES | 0, 1 | Build the example applications (default: 1) |
-| CELERITY_ENABLE_MPI | 0, 1 | Enable MPI support (default: 1) |
-| CELERITY_TRACY_SUPPORT | 0, 1 | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: 0) |
-| CELERITY_USE_MIMALLOC | 0, 1 | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: 1) |
+| CELERITY_ACCESS_PATTERN_DIAGNOSTICS | `BOOL` | Diagnose uninitialized reads and overlapping writes (default: `ON` for debug builds, `OFF` for release builds) |
+| CELERITY_ACCESSOR_BOUNDARY_CHECK | `BOOL` | Enable boundary checks for accessors (default: `ON` for debug builds, `OFF` for release builds) |
+| CELERITY_BUILD_EXAMPLES | `BOOL` | Build the example applications (default: `ON`) |
+| CELERITY_ENABLE_MPI | `BOOL` | Enable MPI support (default: `ON`) |
+| CELERITY_TRACY_SUPPORT | `BOOL` | Enable [Tracy](https://github.com/wolfpld/tracy) support. See [Configuration](configuration.md) for runtime options. (default: `OFF`) |
+| CELERITY_USE_MIMALLOC | `BOOL` | Use the [mimalloc](https://github.com/microsoft/mimalloc) memory allocator (default: `ON`) |
 
 ## Building and Installing
 

diff --git a/include/affinity.h b/include/affinity.h
@@ -18,23 +18,23 @@ namespace celerity::detail::thread_pinning {
 
 constexpr uint32_t thread_type_step = 10000;
 
-// The threads Celerity interacts with ("user") and creates (everything else), identified for the purpose of pinning.
+// The threads Celerity interacts with ("application") and creates (everything else), identified for the purpose of pinning.
 // Note: this is not an enum class to make interactions such as specifying `first_backend_worker+i` easier
 enum thread_type : uint32_t {
 	application = 0 * thread_type_step,
 	scheduler = 1 * thread_type_step,
 	executor = 2 * thread_type_step,
-	first_backend_worker = 3 * thread_type_step,
+	first_device_submitter = 3 * thread_type_step,
 	first_host_queue = 4 * thread_type_step,
 	max = 5 * thread_type_step,
 };
 std::string thread_type_to_string(const thread_type t_type);
 
 // User-level configuration of the thread pinning mechanism (set by the user via environment variables)
 struct environment_configuration {
-	bool enabled = true; // we want thread pinning to be enabled by default
-	uint32_t starting_from_core = 1;
-	std::vector<uint32_t> hardcoded_core_ids;
+	bool enabled = true;                      // we want thread pinning to be enabled by default
+	uint32_t starting_from_core = 1;          // we default to starting from core 1 since core 0 is frequently used by some processes
+	std::vector<uint32_t> hardcoded_core_ids; // starts empty, which means no hardcoded IDs are used
 };
 
 // Parses and validates the environment variable string, returning the corresponding configuration
@@ -43,7 +43,7 @@ environment_configuration parse_validate_env(const std::string_view str);
 // Configures the pinning mechanism
 // For now, only "standard" threads are pinned
 //   these are threads that benefit from rapid communication between each other,
-//   i.e. scheduler -> executor -> backend workers
+//   i.e. applciation -> scheduler -> executor -> device submission threads
 // Extensible for future use where some threads might benefit from NUMA-aware per-GPU pinning
 struct runtime_configuration {
 	// Whether or not to perform pinning
@@ -54,7 +54,7 @@ struct runtime_configuration {
 	// Whether backend device submission threads are used and need to have cores allocated to them
 	bool use_backend_device_submission_threads = true;
 
-	// Number of processes running in legacy mode
+	// Number of processes running in legacy mode on this machine
 	uint32_t num_legacy_processes = 1;
 	// Process index of current process running in legacy mode
 	uint32_t legacy_process_index = 0;
@@ -78,6 +78,9 @@ class thread_pinner {
 	thread_pinner& operator=(const thread_pinner&) = delete;
 	thread_pinner(thread_pinner&&) = default;
 	thread_pinner& operator=(thread_pinner&&) = default;
+
+  private:
+	bool m_successfully_initialized = false;
 };
 
 // Pins the invoking thread of type `t_type` according to the current configuration

diff --git a/src/affinity.cc b/src/affinity.cc
@@ -19,8 +19,8 @@ std::string thread_type_to_string(const thread_type t_type) {
 	case thread_type::executor: return "executor";
 	default: break;
 	}
-	if(t_type >= thread_type::first_backend_worker && t_type < thread_type::first_host_queue) {
-		return fmt::format("backend_worker_{}", t_type - thread_type::first_backend_worker);
+	if(t_type >= thread_type::first_device_submitter && t_type < thread_type::first_host_queue) {
+		return fmt::format("device_submitter_{}", t_type - thread_type::first_device_submitter);
 	}
 	if(t_type >= thread_type::first_host_queue && t_type < thread_type::max) { return fmt::format("host_queue_{}", t_type - thread_type::first_host_queue); }
 	return fmt::format("unknown({})", static_cast<uint32_t>(t_type));
@@ -77,7 +77,6 @@ environment_configuration parse_validate_env(const std::string_view str) {
 	try {
 		return {env::default_parser<bool>{}(str), auto_start_from_core, {}};
 	} catch(const env::parser_error& e) { throw env::parser_error{fmt::format(error_msg, e.what())}; }
-	return {};
 }
 
 } // namespace celerity::detail::thread_pinning
diff --git a/src/backend/sycl_backend.cc b/src/backend/sycl_backend.cc
@@ -175,7 +175,7 @@ sycl_backend::sycl_backend(const std::vector<sycl::device>& devices, const confi
 			m_impl->devices[did].submission_thread.emplace(fmt::format("cy-be-submission-{}", did.value), m_impl->config.profiling);
 			// no need to wait for the event -> will happen before the first task is submitted
 			(void)m_impl->devices[did].submission_thread->submit([did] {
-				thread_pinning::pin_this_thread(thread_pinning::thread_type(thread_pinning::thread_type::first_backend_worker + did.value));
+				thread_pinning::pin_this_thread(thread_pinning::thread_type(thread_pinning::thread_type::first_device_submitter + did.value));
 				closure_hydrator::make_available();
 			});
 		}

diff --git a/src/platform_specific/affinity.unix.cc b/src/platform_specific/affinity.unix.cc
@@ -18,18 +18,13 @@ using namespace celerity::detail::thread_pinning;
 std::vector<uint32_t> get_available_sequential_cores(const cpu_set_t& available_cores, const uint32_t count, const uint32_t starting_from_core) {
 	std::vector<uint32_t> cores;
 	uint32_t current_core = starting_from_core;
-	uint32_t assigned_cores = 0;
 	for(uint32_t i = 0; i < count; ++i) {
 		// find the next sequential core we may use
 		while(CPU_ISSET(current_core, &available_cores) == 0 && current_core < CPU_SETSIZE) {
 			current_core++;
 		}
-		if(current_core >= CPU_SETSIZE) {
-			CELERITY_WARN("Ran out of available cores for thread pinning after assigning {} cores, will disable pinning.", assigned_cores);
-			return {};
-		}
+		if(current_core >= CPU_SETSIZE) { return {}; }
 		cores.push_back(current_core++);
-		++assigned_cores;
 	}
 	return cores;
 }
@@ -66,42 +61,51 @@ thread_local std::optional<thread_remover> t_remover; // NOLINT(cppcoreguideline
 // Initializes the thread pinning machinery
 // This captures the current thread's affinity mask and sets the thread pinning machinery up
 // Calls to pin_this_thread prior to this call will have no effect
-void initialize(const runtime_configuration& cfg) {
+bool initialize(const runtime_configuration& cfg) {
 	std::lock_guard lock(g_state.mutex);
-	assert(!g_state.initialized && "Thread pinning already initialized.");
+	if(g_state.initialized) {
+		CELERITY_ERROR("Thread pinning already initialized. Ignoring this initialization attempt.");
+		return false;
+	}
 	assert(g_state.thread_pinning_plan.empty() && "Thread pinning plan not initially empty.");
 	assert(g_state.pinned_threads.empty() && "Pinned threads not initially empty.");
 
 	g_state.config = cfg;
 
 	const auto ret = sched_getaffinity(0, sizeof(cpu_set_t), &g_state.available_cores);
 	if(ret != 0) {
-		CELERITY_WARN("Error retrieving initial process affinity mask, will disable pinning.");
+		CELERITY_WARN("Error retrieving initial process affinity mask. Unable to check whether enough logical cores are available to this process.{}",
+		    cfg.enabled ? " Will disable thread pinning." : "");
 		g_state.config.enabled = false;
-		return;
+		return true;
 	}
 
-	// pinned threads per process: user, scheduler, executor, 1 backend worker per device if enabled
+	// pinned threads per process: application, scheduler, executor, 1 device submitter per device if enabled
 	uint32_t pinned_threads_per_process = 3;
 	if(g_state.config.use_backend_device_submission_threads) { pinned_threads_per_process += g_state.config.num_devices; }
-	// total number of threads to be pinned across processes (legacy mode)
+	// total number of threads to be pinned across processes (legacy mode) - we assume that each process has been assigned the same number of device
 	const uint32_t total_threads = pinned_threads_per_process * g_state.config.num_legacy_processes;
 
 	if(g_state.config.enabled) {
 		// select the core set to use
 		std::vector<uint32_t> selected_core_ids = {};
 		if(!cfg.hardcoded_core_ids.empty()) {
-			// just use the provided hardcoded IDs
+			// attempt to use the provided hardcoded IDs if they match the number of threads to be pinned
 			if(static_cast<uint32_t>(cfg.hardcoded_core_ids.size()) != total_threads) {
-				CELERITY_WARN("Hardcoded core ID count ({}) does not match the number of threads to be pinned ({}), will disable pinning.",
+				CELERITY_WARN("Hardcoded core ID count ({}) does not match the number of threads to be pinned ({}), downgrading to auto-pinning.",
 				    cfg.hardcoded_core_ids.size(), total_threads);
 			} else {
 				selected_core_ids = cfg.hardcoded_core_ids;
 			}
-		} else {
+		}
+		if(selected_core_ids.empty()) {
 			// otherwise, sequential core assignments for now; it is most important that each of the threads is "close"
 			// to the ones next to it in this sequence, so that communication between them is fast
 			selected_core_ids = get_available_sequential_cores(g_state.available_cores, total_threads, cfg.standard_core_start_id);
+			if(selected_core_ids.empty()) {
+				CELERITY_WARN("Insufficient available cores for thread pinning (required {}, {} available), disabling pinning.", //
+				    total_threads, CPU_COUNT(&g_state.available_cores));
+			}
 		}
 		// build our pinning plan based on the selected core list
 		if(selected_core_ids.empty()) {
@@ -114,7 +118,7 @@ void initialize(const runtime_configuration& cfg) {
 			if(g_state.config.use_backend_device_submission_threads) {
 				for(uint32_t i = 0; i < g_state.config.num_devices; ++i) {
 					const auto device_tid =
-					    static_cast<thread_type>(thread_type::first_backend_worker + i); // NOLINT(clang-analyzer-optin.core.EnumCastOutOfRange)
+					    static_cast<thread_type>(thread_type::first_device_submitter + i); // NOLINT(clang-analyzer-optin.core.EnumCastOutOfRange)
 					g_state.thread_pinning_plan.emplace(device_tid, selected_core_ids[current_core_id++]);
 				}
 			}
@@ -130,6 +134,7 @@ void initialize(const runtime_configuration& cfg) {
 	}
 
 	g_state.initialized = true;
+	return true;
 }
 
 // Tears down the thread pinning machinery
@@ -166,8 +171,10 @@ void teardown() {
 
 namespace celerity::detail::thread_pinning {
 
-thread_pinner::thread_pinner(const runtime_configuration& cfg) { initialize(cfg); }
-thread_pinner::~thread_pinner() { teardown(); }
+thread_pinner::thread_pinner(const runtime_configuration& cfg) : m_successfully_initialized(initialize(cfg)) {}
+thread_pinner::~thread_pinner() {
+	if(m_successfully_initialized) { teardown(); }
+}
 
 void pin_this_thread(const thread_type t_type) {
 	std::lock_guard lock(g_state.mutex);