-
Notifications
You must be signed in to change notification settings - Fork 12.1k
ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@ggerganov @xctan Could you please take a look at the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting results. But overall I am not sure what is the best way to integrate this. As it is, there too many ifdef
s that would make the maintenance difficult.
Can we split somehow the PR into smaller steps? For example, these changes require different tensor and buffer alignment. Should we first design a mechanism to adjust these alignments dynamically and once we have that, the NUMA changes could utilize this new functionality to set the proper alignment. Not sure.
@slaren Do you have thoughts on this change?
@@ -152,6 +152,11 @@ set(GGML_BLAS_VENDOR ${GGML_BLAS_VENDOR_DEFAULT} CACHE STRING | |||
"ggml: BLAS library vendor") | |||
option(GGML_LLAMAFILE "ggml: use LLAMAFILE" ${GGML_LLAMAFILE_DEFAULT}) | |||
|
|||
option(GGML_NUMA_MIGRATE "ggml: use NUMA_MIGRATE" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should become GGML_CPU_NUMA
. AFAIU, these changes require linking libnuma
, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm OK with the name change since it's targeted for CPU:)
option(GGML_NUMA_MIGRATE "ggml: use NUMA_MIGRATE" OFF) | ||
set(GGML_NUMA_MIGRATE_NODES "2" CACHE STRING | ||
"ggml: the number of NUMA nodes during page migration") | ||
option(GGML_NUMA_MIGRATE_DEBUG "ggml: enable debugging of NUMA_MIGRATE" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this by checking an environment variable instead. See GGML_SCHED_DEBUG
for an example.
#ifdef GGML_USE_NUMA_MIGRATE | ||
GGML_API size_t ggml_backend_get_page_size(void); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like should be moved to ggml-cpu-impl.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to move the function definition from ggml-backend.cpp to ggml-cpu.c, but there're lots of link issues since the ggml-cpu.c is in the subdirectories of ggml-cpu while it can't be referenced by files under parent directories under ggml:
./ggml/src/{files needs to reference the function}
./ggml/src/ggml-cpu
[ 18%] Linking CXX shared library ../bin/libllama.so
/usr/bin/ld: ../../bin/libggml-base.so: undefined reference to `ggml_cpu_get_page_size'
moved here:
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -583,6 +583,14 @@ int ggml_threadpool_chunk_add(struct ggml_threadpool * tp, int value) {
}
#ifdef GGML_USE_NUMA_MIGRATE
+int ggml_cpu_page_size = 0;
+size_t ggml_cpu_get_page_size(void) {
+ if (ggml_cpu_page_size == 0) {
+ ggml_cpu_page_size = sysconf(_SC_PAGE_SIZE);
+ }
+ return ggml_cpu_page_size;
+}
+
#ifdef GGML_USE_NUMA_MIGRATE | ||
uint8_t * work_data_numa[GGML_NUMA_MIGRATE_NODES]; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to find a way to avoid so many #ifdef
s through out the codebase. Not sure exactly how, but in this specific case, we could maybe wrap the work_data
logic into a type and the underlying implementation can decide how many buffers to create and if they would be numa aware, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's a valid concern, i'll need to find a better way.
static void migrate_pages_with_cache(void *addr, size_t size, | ||
bool force_memset) { | ||
if (size >= GGML_NUMA_MIGRATE_NODES * ggml_backend_page_size) { | ||
numa_migrate_mapping_cache current_addr(addr, size); | ||
std::lock_guard<std::mutex> lock(ggml_mapping_mutex); | ||
auto it = ggml_mapping_cache.find(current_addr); | ||
if (it == ggml_mapping_cache.end()) { | ||
GGML_ASSERT(((uint64_t)(addr) & (ggml_backend_page_size - 1)) == 0); | ||
int num_pages = | ||
size / ggml_backend_page_size / GGML_NUMA_MIGRATE_NODES; | ||
if (num_pages && ((size % ggml_backend_page_size) == 0)) { | ||
if (force_memset) { | ||
memset(addr, 0, size); // force to allocate memory | ||
} | ||
if (migrate_pages_multiple_nodes(addr, size) != 0) { | ||
GGML_LOG_DEBUG("Migration to multiple nodes failed, addr: " | ||
"%p, size: %ld\n", | ||
addr, size); | ||
} else { | ||
#ifdef GGML_USE_NUMA_MIGRATE_DEBUG | ||
check_numa_pages_migration(addr, size); | ||
#endif | ||
} | ||
ggml_mapping_cache.insert(current_addr); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with NUMA. Could you explain in simple terms the logic that is implemented here and why it improves the performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, in multi-numa nodes systems, the memory access to local numa node is faster than access to remote numa node, the tensor buffer is allocated through malloc() which is not numa aware (it might be in any numa node or interleaved depending on kernel config), since a single tensor buffer is computed by multi-threads, in a multi-numa systems, there would be lots of cross-numa memory accesses (for example, if the tensor buffer is allocated totally in numa node 1, cores in numa node 0 would access the total tensor buffer in a cross-numa fasion which is very bad).
the change here is to move the pages of the tensor buffer to numa nodes evenly, then affinity the cores, by doing this, cores would only need to access the memory in its local numa node (prevents remote numa node memory access), as the Mul_Mat op splits the tensor buffer based on the ith (core index), hence there's significant performance improvement with the change for systems which have large cross-numa memory access latency (which is the case for aarch64 NeoVerse platform, x86_64 platforms would also benefits from the change though not so significantly as the cross-numa memory access latency is better).
The changes to |
yeah, that's a good point. could you please take a look if there's way to refactor the interfacing and then address the cross-numa memory access problem in a more decent way? the change here is more or less a showcase that there's significant performance improvement if the problem is well addressed. |
This PR adds GGML_NUMA_MIGRATE feature to optimize cross NUMA op computation as the cross-NUMA memory access would be the bottleneck if spawning threads across NUMA nodes:
With the feature, tested on NeoVerse N2 with multiple NUMA nodes (with 64 cores per NUMA node), there's obvious performance improvements with running llama3 model, see test results as below:
bench Test Results
Base data
running command:
results:
Result with the feature
running command:
results:
improvement:
batched bench Test Results
Base data
running command:
results:
Result with the feature
running command:
results:
improvement:
Tested perplexity, there's no regression with the feature:
Enablement
The feature is disabled by default, this is the guidances to enable the feature: