ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

wenlujon · 2025-06-17T06:31:57Z

This PR adds GGML_NUMA_MIGRATE feature to optimize cross NUMA op computation as the cross-NUMA memory access would be the bottleneck if spawning threads across NUMA nodes:

optimize the ggml_barrier() for cross NUMA case by adding ggml_barrier_numa_aware()
add build option GGML_NUMA_MIGRATE to enable the feature
add option --numa migrate if GGML_NUMA_MIGRATE is enabled, which would migrate pages across NUMA nodes so that mul_mat op would only do the computation in local numa part of tensor data src0 and dst according to the ith, cores would be set affinity with the option. also src1 data will quantized into the local numa wdata.

With the feature, tested on NeoVerse N2 with multiple NUMA nodes (with 64 cores per NUMA node), there's obvious performance improvements with running llama3 model, see test results as below:

bench Test Results

Base data

running command:

numactl -m 0 -N 0 $LLAMA/build/bin/llama-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -r 1 -t 64 -p 512 -n 128

results:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      64 |           pp512 |        505.06 ± 0.00 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      64 |           tg128 |         25.43 ± 0.00 |
 
build: e434e691 (5686)

Result with the feature

running command:

numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -r 1 -t 128 -p 512 -n 128 --numa migrate

results:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |     128 |           pp512 |        761.85 ± 0.00 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |     128 |           tg128 |         33.76 ± 0.00 |
 
build: e434e691 (5686)

improvement:


TC | Base | OPT | Uplift
-- | -- | -- | --
pp512 | 505.06 | 761.85 | 50.8%
tg128 | 25.43 | 33.76 | 32.7%

batched bench Test Results

Base data

running command:

  numactl -m 0 -N 0 $LLAMA/build/bin/llama-batched-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --cache-type-k q8_0 --cache-type-v q8_0 -fa

results:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.268 |   477.19 |    5.047 |    25.36 |    5.315 |    48.17 |
|   128 |    128 |    4 |   1024 |    0.940 |   544.73 |    5.721 |    89.50 |    6.661 |   153.74 |
|   128 |    128 |    8 |   2048 |    1.839 |   556.85 |    8.108 |   126.29 |    9.947 |   205.88 |
|   128 |    128 |   16 |   4096 |    3.689 |   555.11 |    8.973 |   228.23 |   12.663 |   323.47 |
|   128 |    128 |   32 |   8192 |    7.475 |   547.94 |   14.847 |   275.89 |   22.322 |   366.99 |

Result with the feature

running command:

numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-batched-bench -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 128 --numa migrate --cache-type-k q8_0 --cache-type-v q8_0  -fa

results:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.165 |   777.10 |    3.628 |    35.28 |    3.793 |    67.49 |
|   128 |    128 |    4 |   1024 |    0.556 |   921.39 |    4.312 |   118.73 |    4.868 |   210.35 |
|   128 |    128 |    8 |   2048 |    1.057 |   968.92 |    5.529 |   185.20 |    6.586 |   310.96 |
|   128 |    128 |   16 |   4096 |    2.133 |   960.04 |    6.414 |   319.30 |    8.547 |   479.22 |
|   128 |    128 |   32 |   8192 |    4.311 |   950.20 |   10.013 |   409.07 |   14.324 |   571.92 |

improvement:

B | Base | OPT | Uplift
-- | -- | -- | --
1 | 25.36 | 35.28 | 39.1%
4 | 89.5 | 118.73 | 32.6%
8 | 126.29 | 185.2 | 46.6%
16 | 228.23 | 319.30 | 39.9%
32 | 275.89 | 409.07 | 48.2%

Tested perplexity, there's no regression with the feature:

$ numactl -m 0,1 -N 0,1 $LLAMA/build/bin/llama-perplexity -m ./models/Meta-Llama-3-8B-Instruct.Q4_0.gguf -f $LLAMA/wikitext-2-raw/wiki.test.raw -t 128 --numa migrate

Final estimate: PPL = 8.7547 +/- 0.06483

Enablement

The feature is disabled by default, this is the guidances to enable the feature:

enable DGGML_NUMA_MIGRATE during cmake

$ cmake -B build -DGGML_NUMA_MIGRATE=ON

add --numa migrate option when running, and set thread numbers according to the number of numa nodes (which is 2 nodes by default which is the best trade-off number during the NeoVerse N2 platform in test, could be changed by setting GGML_NUMA_MIGRATE_NODES build option, it's also better to bind llama to numa node 0 and 1 through numactl -m 0,1 -N 0,1).

wenlujon · 2025-06-18T06:46:25Z

@ggerganov @xctan Could you please take a look at the PR?

ggerganov

Interesting results. But overall I am not sure what is the best way to integrate this. As it is, there too many ifdefs that would make the maintenance difficult.

Can we split somehow the PR into smaller steps? For example, these changes require different tensor and buffer alignment. Should we first design a mechanism to adjust these alignments dynamically and once we have that, the NUMA changes could utilize this new functionality to set the proper alignment. Not sure.

@slaren Do you have thoughts on this change?

ggerganov · 2025-06-18T07:55:47Z

ggml/CMakeLists.txt

@@ -152,6 +152,11 @@ set(GGML_BLAS_VENDOR ${GGML_BLAS_VENDOR_DEFAULT} CACHE STRING
                                            "ggml: BLAS library vendor")
 option(GGML_LLAMAFILE                       "ggml: use LLAMAFILE"                             ${GGML_LLAMAFILE_DEFAULT})

+option(GGML_NUMA_MIGRATE                    "ggml: use NUMA_MIGRATE"                          OFF)


I think this should become GGML_CPU_NUMA. AFAIU, these changes require linking libnuma, is that correct?

yeah, I'm OK with the name change since it's targeted for CPU:)

ggerganov · 2025-06-18T07:56:16Z

ggml/CMakeLists.txt

+option(GGML_NUMA_MIGRATE                    "ggml: use NUMA_MIGRATE"                          OFF)
+set(GGML_NUMA_MIGRATE_NODES "2" CACHE STRING
+                                            "ggml: the number of NUMA nodes during page migration")
+option(GGML_NUMA_MIGRATE_DEBUG              "ggml: enable debugging of NUMA_MIGRATE"          OFF)


Avoid this by checking an environment variable instead. See GGML_SCHED_DEBUG for an example.

ggerganov · 2025-06-18T07:57:48Z

ggml/include/ggml-backend.h

+#ifdef GGML_USE_NUMA_MIGRATE
+    GGML_API size_t ggml_backend_get_page_size(void);
+#endif


This seems like should be moved to ggml-cpu-impl.h

I tried to move the function definition from ggml-backend.cpp to ggml-cpu.c, but there're lots of link issues since the ggml-cpu.c is in the subdirectories of ggml-cpu while it can't be referenced by files under parent directories under ggml:
./ggml/src/{files needs to reference the function}
./ggml/src/ggml-cpu

[ 18%] Linking CXX shared library ../bin/libllama.so /usr/bin/ld: ../../bin/libggml-base.so: undefined reference to `ggml_cpu_get_page_size'

moved here:

--- a/ggml/src/ggml-cpu/ggml-cpu.c +++ b/ggml/src/ggml-cpu/ggml-cpu.c @@ -583,6 +583,14 @@ int ggml_threadpool_chunk_add(struct ggml_threadpool * tp, int value) { } #ifdef GGML_USE_NUMA_MIGRATE +int ggml_cpu_page_size = 0; +size_t ggml_cpu_get_page_size(void) { + if (ggml_cpu_page_size == 0) { + ggml_cpu_page_size = sysconf(_SC_PAGE_SIZE); + } + return ggml_cpu_page_size; +} +

ggerganov · 2025-06-18T08:00:16Z

ggml/include/ggml-cpu.h

+#ifdef GGML_USE_NUMA_MIGRATE
+        uint8_t * work_data_numa[GGML_NUMA_MIGRATE_NODES];
+#endif


We have to find a way to avoid so many #ifdefs through out the codebase. Not sure exactly how, but in this specific case, we could maybe wrap the work_data logic into a type and the underlying implementation can decide how many buffers to create and if they would be numa aware, etc.

yeah, it's a valid concern, i'll need to find a better way.

ggerganov · 2025-06-18T08:01:40Z

ggml/src/ggml-backend.cpp

+static void migrate_pages_with_cache(void *addr, size_t size,
+                                     bool force_memset) {
+    if (size >= GGML_NUMA_MIGRATE_NODES * ggml_backend_page_size) {
+        numa_migrate_mapping_cache current_addr(addr, size);
+        std::lock_guard<std::mutex> lock(ggml_mapping_mutex);
+        auto it = ggml_mapping_cache.find(current_addr);
+        if (it == ggml_mapping_cache.end()) {
+            GGML_ASSERT(((uint64_t)(addr) & (ggml_backend_page_size - 1)) == 0);
+            int num_pages =
+                size / ggml_backend_page_size / GGML_NUMA_MIGRATE_NODES;
+            if (num_pages && ((size % ggml_backend_page_size) == 0)) {
+                if (force_memset) {
+                    memset(addr, 0, size); // force to allocate memory
+                }
+                if (migrate_pages_multiple_nodes(addr, size) != 0) {
+                    GGML_LOG_DEBUG("Migration to multiple nodes failed, addr: "
+                                   "%p, size: %ld\n",
+                                   addr, size);
+                } else {
+#ifdef GGML_USE_NUMA_MIGRATE_DEBUG
+                    check_numa_pages_migration(addr, size);
+#endif
+                }
+                ggml_mapping_cache.insert(current_addr);
+            }
+        }
+    }
+}


I'm not very familiar with NUMA. Could you explain in simple terms the logic that is implemented here and why it improves the performance?

basically, in multi-numa nodes systems, the memory access to local numa node is faster than access to remote numa node, the tensor buffer is allocated through malloc() which is not numa aware (it might be in any numa node or interleaved depending on kernel config), since a single tensor buffer is computed by multi-threads, in a multi-numa systems, there would be lots of cross-numa memory accesses (for example, if the tensor buffer is allocated totally in numa node 1, cores in numa node 0 would access the total tensor buffer in a cross-numa fasion which is very bad).

the change here is to move the pages of the tensor buffer to numa nodes evenly, then affinity the cores, by doing this, cores would only need to access the memory in its local numa node (prevents remote numa node memory access), as the Mul_Mat op splits the tensor buffer based on the ith (core index), hence there's significant performance improvement with the change for systems which have large cross-numa memory access latency (which is the case for aarch64 NeoVerse platform, x86_64 platforms would also benefits from the change though not so significantly as the cross-numa memory access latency is better).

slaren · 2025-06-18T09:52:18Z

@slaren Do you have thoughts on this change?

The changes to ggml-backend.cpp are not good, this is not code that should be in the common backend interface. Generally I think that to make merging this viable, it would need a lot more work to isolate the code of this feature to a single location, possible as part of the CPU backend. If the ggml-backend interface or the code of the CPU backend is not flexible enough to implement it in that way, then it will require changing the interfaces to factor in the new requirements.

wenlujon · 2025-06-18T09:57:35Z

@slaren Do you have thoughts on this change?

The changes to ggml-backend.cpp are not good, this is not code that should be in the common backend interface. Generally I think that to make merging this viable, it would need a lot more work to isolate the code of this feature to a single location, possible as part of the CPU backend. If the ggml-backend interface or the code of the CPU backend is not flexible enough to implement it in that way, then it will require changing the interfaces to factor in the new requirements.

yeah, that's a good point. could you please take a look if there's way to refactor the interfacing and then address the cross-numa memory access problem in a more decent way? the change here is more or less a showcase that there's significant performance improvement if the problem is well addressed.

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation

53370cb

wenlujon mentioned this pull request Jun 17, 2025

add GGML_USE_NUMA_MIGRATE feature to optimize cross NUMA op computation #13649

Closed

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 17, 2025

ggerganov reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

wenlujon commented Jun 17, 2025 •

edited

Loading

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

ggerganov left a comment •

edited

Loading

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

wenlujon Jun 18, 2025

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

wenlujon Jun 18, 2025

Uh oh!

wenlujon Jun 18, 2025

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

wenlujon Jun 18, 2025

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

wenlujon Jun 18, 2025

Uh oh!

slaren commented Jun 18, 2025

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

Uh oh!

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

Are you sure you want to change the base?

ggml: introduce GGML_NUMA_MIGRATE to optimize cross NUMA op computation #14232

Conversation

wenlujon commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bench Test Results

Base data

Result with the feature

batched bench Test Results

Base data

Result with the feature

Enablement

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren commented Jun 18, 2025

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

Uh oh!

wenlujon commented Jun 17, 2025 •

edited

Loading

ggerganov left a comment •

edited

Loading