Skip to content

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wenlujon
Copy link

@wenlujon wenlujon commented Jun 17, 2025

This PR adds GGML_NUMA_MIGRATE feature to optimize cross NUMA op computation as the cross-NUMA memory access would be the bottleneck if spawning threads across NUMA nodes:

  1. optimize the ggml_barrier() for cross NUMA case by adding ggml_barrier_numa_aware()
  2. add build option GGML_NUMA_MIGRATE to enable the feature
  3. add option --numa migrate if GGML_NUMA_MIGRATE is enabled, which would migrate pages across NUMA nodes so that mul_mat op would only do the computation in local numa part of tensor data src0 and dst according to the ith, cores would be set affinity with the option. also src1 data will quantized into the local numa wdata.

With the feature, tested on NeoVerse N2 with multiple NUMA nodes (with 64 cores per NUMA node), there's obvious performance improvements with running llama3 model, see test results as below:

bench Test Results

Base data

running command:

numactl -m 0 -N 0 $LLAMA/build/bin/llama-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -r 1 -t 64 -p 512 -n 128

results:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      64 |           pp512 |        505.06 ± 0.00 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      64 |           tg128 |         25.43 ± 0.00 |
 
build: e434e691 (5686)

Result with the feature

running command:

numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -r 1 -t 128 -p 512 -n 128 --numa migrate

results:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |     128 |           pp512 |        761.85 ± 0.00 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |     128 |           tg128 |         33.76 ± 0.00 |
 
build: e434e691 (5686) 

improvement:


TC | Base | OPT | Uplift
-- | -- | -- | --
pp512 | 505.06 | 761.85 | 50.8%
tg128 | 25.43 | 33.76 | 32.7%


batched bench Test Results

Base data

running command:

  numactl -m 0 -N 0 $LLAMA/build/bin/llama-batched-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --cache-type-k q8_0 --cache-type-v q8_0 -fa

results:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.268 |   477.19 |    5.047 |    25.36 |    5.315 |    48.17 |
|   128 |    128 |    4 |   1024 |    0.940 |   544.73 |    5.721 |    89.50 |    6.661 |   153.74 |
|   128 |    128 |    8 |   2048 |    1.839 |   556.85 |    8.108 |   126.29 |    9.947 |   205.88 |
|   128 |    128 |   16 |   4096 |    3.689 |   555.11 |    8.973 |   228.23 |   12.663 |   323.47 |
|   128 |    128 |   32 |   8192 |    7.475 |   547.94 |   14.847 |   275.89 |   22.322 |   366.99 |


Result with the feature

running command:

numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-batched-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 128 --numa migrate --cache-type-k q8_0 --cache-type-v q8_0  -fa

results:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.165 |   777.10 |    3.628 |    35.28 |    3.793 |    67.49 |
|   128 |    128 |    4 |   1024 |    0.556 |   921.39 |    4.312 |   118.73 |    4.868 |   210.35 |
|   128 |    128 |    8 |   2048 |    1.057 |   968.92 |    5.529 |   185.20 |    6.586 |   310.96 |
|   128 |    128 |   16 |   4096 |    2.133 |   960.04 |    6.414 |   319.30 |    8.547 |   479.22 |
|   128 |    128 |   32 |   8192 |    4.311 |   950.20 |   10.013 |   409.07 |   14.324 |   571.92 |

improvement:

B | Base | OPT | Uplift
-- | -- | -- | --
1 | 25.36 | 35.28 | 39.1%
4 | 89.5 | 118.73 | 32.6%
8 | 126.29 | 185.2 | 46.6%
16 | 228.23 | 319.30 | 39.9%
32 | 275.89 | 409.07 | 48.2%

Tested perplexity, there's no regression with the feature:

$ numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-perplexity -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -f $LLAMA/wikitext-2-raw/wiki.test.raw -t 128 --numa migrate

Final estimate: PPL = 8.7547 +/- 0.06483


Enablement

The feature is disabled by default, this is the guidances to enable the feature:

  1. enable DGGML_NUMA_MIGRATE during cmake
$ cmake -B build -DGGML_NUMA_MIGRATE=ON
  1. add --numa migrate option when running, and set thread numbers according to the number of numa nodes (which is 2 nodes by default which is the best trade-off number during the NeoVerse N2 platform in test, could be changed by setting GGML_NUMA_MIGRATE_NODES build option, it's also better to bind llama to numa node 0 and 1 through numactl -m 0,1 -N 0,1).

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 17, 2025
@wenlujon
Copy link
Author

@ggerganov @xctan Could you please take a look at the PR?

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting results. But overall I am not sure what is the best way to integrate this. As it is, there too many ifdefs that would make the maintenance difficult.

Can we split somehow the PR into smaller steps? For example, these changes require different tensor and buffer alignment. Should we first design a mechanism to adjust these alignments dynamically and once we have that, the NUMA changes could utilize this new functionality to set the proper alignment. Not sure.

@slaren Do you have thoughts on this change?

@@ -152,6 +152,11 @@ set(GGML_BLAS_VENDOR ${GGML_BLAS_VENDOR_DEFAULT} CACHE STRING
"ggml: BLAS library vendor")
option(GGML_LLAMAFILE "ggml: use LLAMAFILE" ${GGML_LLAMAFILE_DEFAULT})

option(GGML_NUMA_MIGRATE "ggml: use NUMA_MIGRATE" OFF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should become GGML_CPU_NUMA. AFAIU, these changes require linking libnuma, is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm OK with the name change since it's targeted for CPU:)

option(GGML_NUMA_MIGRATE "ggml: use NUMA_MIGRATE" OFF)
set(GGML_NUMA_MIGRATE_NODES "2" CACHE STRING
"ggml: the number of NUMA nodes during page migration")
option(GGML_NUMA_MIGRATE_DEBUG "ggml: enable debugging of NUMA_MIGRATE" OFF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this by checking an environment variable instead. See GGML_SCHED_DEBUG for an example.

Comment on lines +351 to +353
#ifdef GGML_USE_NUMA_MIGRATE
GGML_API size_t ggml_backend_get_page_size(void);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like should be moved to ggml-cpu-impl.h

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to move the function definition from ggml-backend.cpp to ggml-cpu.c, but there're lots of link issues since the ggml-cpu.c is in the subdirectories of ggml-cpu while it can't be referenced by files under parent directories under ggml:
./ggml/src/{files needs to reference the function}
./ggml/src/ggml-cpu

[ 18%] Linking CXX shared library ../bin/libllama.so
/usr/bin/ld: ../../bin/libggml-base.so: undefined reference to `ggml_cpu_get_page_size'

moved here:

--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -583,6 +583,14 @@ int ggml_threadpool_chunk_add(struct ggml_threadpool * tp, int value) {
 }
 
 #ifdef GGML_USE_NUMA_MIGRATE
+int ggml_cpu_page_size = 0;
+size_t ggml_cpu_get_page_size(void) {
+    if (ggml_cpu_page_size == 0) {
+        ggml_cpu_page_size = sysconf(_SC_PAGE_SIZE);
+    }
+    return ggml_cpu_page_size;
+}
+

Comment on lines +15 to +17
#ifdef GGML_USE_NUMA_MIGRATE
uint8_t * work_data_numa[GGML_NUMA_MIGRATE_NODES];
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to find a way to avoid so many #ifdefs through out the codebase. Not sure exactly how, but in this specific case, we could maybe wrap the work_data logic into a type and the underlying implementation can decide how many buffers to create and if they would be numa aware, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's a valid concern, i'll need to find a better way.

Comment on lines +1918 to +1945
static void migrate_pages_with_cache(void *addr, size_t size,
bool force_memset) {
if (size >= GGML_NUMA_MIGRATE_NODES * ggml_backend_page_size) {
numa_migrate_mapping_cache current_addr(addr, size);
std::lock_guard<std::mutex> lock(ggml_mapping_mutex);
auto it = ggml_mapping_cache.find(current_addr);
if (it == ggml_mapping_cache.end()) {
GGML_ASSERT(((uint64_t)(addr) & (ggml_backend_page_size - 1)) == 0);
int num_pages =
size / ggml_backend_page_size / GGML_NUMA_MIGRATE_NODES;
if (num_pages && ((size % ggml_backend_page_size) == 0)) {
if (force_memset) {
memset(addr, 0, size); // force to allocate memory
}
if (migrate_pages_multiple_nodes(addr, size) != 0) {
GGML_LOG_DEBUG("Migration to multiple nodes failed, addr: "
"%p, size: %ld\n",
addr, size);
} else {
#ifdef GGML_USE_NUMA_MIGRATE_DEBUG
check_numa_pages_migration(addr, size);
#endif
}
ggml_mapping_cache.insert(current_addr);
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with NUMA. Could you explain in simple terms the logic that is implemented here and why it improves the performance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically, in multi-numa nodes systems, the memory access to local numa node is faster than access to remote numa node, the tensor buffer is allocated through malloc() which is not numa aware (it might be in any numa node or interleaved depending on kernel config), since a single tensor buffer is computed by multi-threads, in a multi-numa systems, there would be lots of cross-numa memory accesses (for example, if the tensor buffer is allocated totally in numa node 1, cores in numa node 0 would access the total tensor buffer in a cross-numa fasion which is very bad).

the change here is to move the pages of the tensor buffer to numa nodes evenly, then affinity the cores, by doing this, cores would only need to access the memory in its local numa node (prevents remote numa node memory access), as the Mul_Mat op splits the tensor buffer based on the ith (core index), hence there's significant performance improvement with the change for systems which have large cross-numa memory access latency (which is the case for aarch64 NeoVerse platform, x86_64 platforms would also benefits from the change though not so significantly as the cross-numa memory access latency is better).

@slaren
Copy link
Member

slaren commented Jun 18, 2025

@slaren Do you have thoughts on this change?

The changes to ggml-backend.cpp are not good, this is not code that should be in the common backend interface. Generally I think that to make merging this viable, it would need a lot more work to isolate the code of this feature to a single location, possible as part of the CPU backend. If the ggml-backend interface or the code of the CPU backend is not flexible enough to implement it in that way, then it will require changing the interfaces to factor in the new requirements.

@wenlujon
Copy link
Author

@slaren Do you have thoughts on this change?

The changes to ggml-backend.cpp are not good, this is not code that should be in the common backend interface. Generally I think that to make merging this viable, it would need a lot more work to isolate the code of this feature to a single location, possible as part of the CPU backend. If the ggml-backend interface or the code of the CPU backend is not flexible enough to implement it in that way, then it will require changing the interfaces to factor in the new requirements.

yeah, that's a good point. could you please take a look if there's way to refactor the interfacing and then address the cross-numa memory access problem in a more decent way? the change here is more or less a showcase that there's significant performance improvement if the problem is well addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants