Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

syuu1228 · 2025-01-18T03:47:34Z

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.

It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on resource.cc.
hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797

denesb

I was asked to review so I did. I'm not familar with this code but from a high level it looks good to me.
Someone more familiar with this should also review.

tchaikov

lgtm in general. in addition to the inlined comments.

could you please use a more specific prefix in the title of the commit message? like: "resource: "
and use a more specific title. like "fall back to single io group if hwloc fails to work".

BTW, could even split the commit into two. one for moving the non-hwloc code up. the other for using it when hwloc fails tell the CPU topology.

tchaikov · 2025-01-20T11:53:52Z

src/core/resource.cc

+// cannot receive error from the API.
+// Therefore, we have to detect cpu topology availability in our code.
+static bool is_hwloc_available() {
+    const std::string cpux_properties[] = {


nit, std::string_view would suffice.

and better off referencing the related hwloc function, so that the posterity understand why we are using this logic to determine if hwloc is able to identify the CPU topology.

tchaikov · 2025-01-20T11:59:31Z

src/core/resource.cc

+    }
+
+    // cpu0 might be offline, try to check first online cpu.
+    auto online = read_first_line_as<std::string>("/sys/devices/system/cpu/online");


I noticed we could potentially use read_first_line_as<unsigned>("/sys/devices/system/cpu/online")
and handle the exception. If this approach wouldn't significantly simplify the implementation, feel free to keep the current solution.

avikivity · 2025-01-20T19:48:25Z

I think this is too extreme. Here's hwloc-ls output on m7gd.16xlarge:

$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)
  PU L#4 (P#4)
  PU L#5 (P#5)
  PU L#6 (P#6)
  PU L#7 (P#7)
  PU L#8 (P#8)
  PU L#9 (P#9)
  PU L#10 (P#10)
  PU L#11 (P#11)
  PU L#12 (P#12)
  PU L#13 (P#13)
  PU L#14 (P#14)
  PU L#15 (P#15)
  PU L#16 (P#16)
  PU L#17 (P#17)
  PU L#18 (P#18)
  PU L#19 (P#19)
  PU L#20 (P#20)
  PU L#21 (P#21)
  PU L#22 (P#22)
  PU L#23 (P#23)
  PU L#24 (P#24)
  PU L#25 (P#25)
  PU L#26 (P#26)
  PU L#27 (P#27)
  PU L#28 (P#28)
  PU L#29 (P#29)
  PU L#30 (P#30)
  PU L#31 (P#31)
  PU L#32 (P#32)
  PU L#33 (P#33)
  PU L#34 (P#34)
  PU L#35 (P#35)
  PU L#36 (P#36)
  PU L#37 (P#37)
  PU L#38 (P#38)
  PU L#39 (P#39)
  PU L#40 (P#40)
  PU L#41 (P#41)
  PU L#42 (P#42)
  PU L#43 (P#43)
  PU L#44 (P#44)
  PU L#45 (P#45)
  PU L#46 (P#46)
  PU L#47 (P#47)
  PU L#48 (P#48)
  PU L#49 (P#49)
  PU L#50 (P#50)
  PU L#51 (P#51)
  PU L#52 (P#52)
  PU L#53 (P#53)
  PU L#54 (P#54)
  PU L#55 (P#55)
  PU L#56 (P#56)
  PU L#57 (P#57)
  PU L#58 (P#58)
  PU L#59 (P#59)
  PU L#60 (P#60)
  PU L#61 (P#61)
  PU L#62 (P#62)
  PU L#63 (P#63)
  HostBridge
    PCI 00:04.0 (NVMExp)
      Block(Disk) "nvme1n1"
    PCI 00:05.0 (Ethernet)
      Net "eth0"
    PCI 00:1e.0 (NVMExp)
      Block(Disk) "nvme0n1"
    PCI 00:1f.0 (NVMExp)
      Block(Disk) "nvme2n1"

So it aborted linux discovery, but was still able to keep going. Maybe hwloc-ls is telling hwloc to fall back to alternative methods if needed, and we are not.

avikivity · 2025-01-20T20:00:45Z

I think the problem is that hwloc doesn't report the memory as belonging to any NUMA nodes (on a normal machine the NUMA nodes have memory counts).

    // Divide local memory to cpus
    for (auto&& cs : cpu_sets()) {
        auto cpu_id = hwloc_bitmap_first(cs);
        assert(cpu_id != -1);
        auto node = cpu_to_node.at(cpu_id);
        cpu this_cpu;
        this_cpu.cpu_id = cpu_id;
        size_t remain = mem_per_proc - alloc_from_node(this_cpu, node, topo_used_mem, mem_per_proc);

        remains.emplace_back(std::move(this_cpu), remain);
    }

    // Divide the rest of the memory
    auto depth = hwloc_get_type_or_above_depth(topology, HWLOC_OBJ_NUMANODE);
    for (auto&& [this_cpu, remain] : remains) {
        auto node = cpu_to_node.at(this_cpu.cpu_id);
        auto obj = node;

        while (remain) {
            remain -= alloc_from_node(this_cpu, obj, topo_used_mem, remain);
            do {
                obj = hwloc_get_next_obj_by_depth(topology, depth, obj);
            } while (!obj);
            if (obj == node)
                break;
        }
        assert(!remain);
        ret.cpus.push_back(std::move(this_cpu));
    }

We need to add a third loop that allocates non-NUMA memory if the second loop fails.

avikivity · 2025-01-20T20:03:31Z

Bad NUMA detection:

  NUMANode L#0 (P#0)

Good NUMA detection:

    NUMANode L#0 (P#0 126GB)

So we just need to detect the case where the detected NUMA memory (here 0) is less that the memory we want to allocate, and treat that case too.

syuu1228 · 2025-01-24T14:52:23Z

@avikivity I completely rewrited the patch, now it does not use non-hwloc functions, it fixup memory size on hwloc objects instead.
How about this?

4f14ad6

avikivity · 2025-01-24T19:07:08Z

Looks good, I'll wait for @tchaikov to review too.

syuu1228 · 2025-01-27T09:04:40Z

@tchaikov Could you review the patch again?

tchaikov

i am sorry for the latency.

left a few comments.

tchaikov · 2025-01-27T13:56:35Z

src/core/resource.cc

@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) {
    return (num + denom - 1) / denom;
 }

-static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) {
+static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) {


i am interested in why you added the inline specifier here? not because i am against this, just because i don't understand why it is necessary.

drop inline since it is not necessary.

tchaikov · 2025-01-27T14:09:29Z

src/core/resource.cc

+                // This code does not assume that there are multiple nodes,
+                // but when this 'if' condition is met, hwloc fails to detect
+                // the hardware configuration and is expected to operate as
+                // a single node configuration, so it should work correctly.


probably could improve this comment a little bit (just for better readability, and assuming the single-node configuration does indeed apply):

// If hwloc fails to detect the hardware topology, it falls back to treating // the system as a single-node configuration. While this code supports // multi-node setups, the fallback behavior is safe and will function // correctly in this case.

merged new comment

tchaikov · 2025-01-27T14:31:21Z

src/core/resource.cc

@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) {
    return (num + denom - 1) / denom;
 }

-static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) {
+static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) {


better off renaming this function to get_memory_of_node(), so that we can reuse it for retrieving the total size of memory of this machine. where the node is an object of HWLOC_OBJ_MACHINE. also, because we retrieve total_memory when HWLOC_API_VERSION >= 0x00020000 .

Applied this idea, use common functions for both machine and node.
New function name are get_memory_from_hwloc_obj() and set_memory_from_hwloc_obj().

tchaikov · 2025-01-27T14:33:37Z

src/core/resource.cc

@@ -579,6 +592,10 @@ resources allocate(configuration& c) {
 #else
    auto available_memory = machine->memory.total_memory;
 #endif
+    if (!available_memory) {


could reuse get_local_memory_from_node() for retrieving the size of memory owned by the machine.

tchaikov · 2025-01-27T14:58:03Z

src/core/resource.cc

+                // but when this 'if' condition is met, hwloc fails to detect
+                // the hardware configuration and is expected to operate as
+                // a single node configuration, so it should work correctly.
+                set_local_memory_to_node(node, available_memory);


i am confused. you're assigning the total size of memory of the whole machine to a NUMA node. on my machine, it is 32GiB. assuming my machine has two NUMA nodes and the OS is NUMA-awared, with your algorithm, we will be splitting 64GiB memory across all the processing units, am i right?

how do we assure that there is only a single NUMA node here ? is this guaranteed by the algorithm or is it based on our observation ? can we verify this in the code, and at least print out an error message if this assumption is violated?

Because when hwloc hits hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. error, it does not proceed rest of HW detection code, so it does not have NUMA nodes information, memory information, etc.
see: https://github.com/open-mpi/hwloc/blob/09f88977dd85b01b2e4915a48c95a90cea74754c/hwloc/topology-linux.c#L5987

I verified this on m8g.metal-48xl, which has 2 NUMA node.
On m8g.metal-48xl, cpu topology issue does not occur, so I emulated the issue by hiding files under topology directory, like this:

for i in /sys/devices/system/cpu/cpu*/topology;do sudo mount -t ramfs none $i;done

Before mounted ramfs, Seastar output is like this:

DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU0 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU1 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU2 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU3 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU4 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU5 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU6 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU7 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU8 to NUMA0 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU9 to NUMA0 ... DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU100 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU101 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU102 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU103 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU104 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU105 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU106 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU107 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU108 to NUMA1 DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU109 to NUMA1 ... DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU191 to NUMA1

So hwloc reported 2 NUMA node.

After mounted ramfs, hwloc reports single NUMA node like this:

DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU0 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU1 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU2 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU3 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU4 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU5 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU6 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU7 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU8 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU9 to NUMA0 ... DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU100 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU101 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU102 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU103 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU104 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU105 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU106 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU107 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU108 to NUMA0 DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU109 to NUMA0 ... DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU191 to NUMA0

So we can say when the issue occured, NUMA node should be single node.

can we verify this in the code, and at least print out an error message if this assumption is violated?

I added assert(num_nodes == 1); for safety.

thank you! it helps to convince us it's indeed safe.

…hwloc On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message: ``` $ /opt/scylladb/bin/scylla --log-to-stdout 1 WARNING: debug mode. Not for benchmarking or production hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed. ``` It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance. I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above. I also found that calculate_memory() going wrong since machine->total_memory is also 0. To avoid the error on such environment, we need to fixup memory size on both machine->total_memory and node->total_memory. We can use sysconf(_SC_PAGESIZE) * sysconf(_SC_PHYS_PAGES) for this, just like we do on non-hwloc version of allocate(). Fixes scylladb/scylladb#22382 Related scylladb/scylla-pkg#4797

tchaikov

lgtm

yaronkaikov · 2025-01-30T11:44:31Z

@avikivity Can we now merge it? we need it for returning to ASG for aarch64

syuu1228 · 2025-01-30T15:19:40Z

@yaronkaikov Opened backport PRs for branch-2024.1 and branch-2024.2:

syuu1228 · 2025-01-30T15:25:25Z

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

yaronkaikov · 2025-01-30T17:50:29Z

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

I think it because we didn't any updates since we branch, IIUC we should now create the branch and do the backport. right @avikivity @denesb ?

avikivity · 2025-01-30T18:11:38Z

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

I think it because we didn't any updates since we branch, IIUC we should now create the branch and do the backport. right @avikivity @denesb ?

Right.

yaronkaikov · 2025-01-31T10:31:48Z

@denesb @avikivity Can you create the new branch and backport this change? we need it in all releases so we can switch back to ASG for aarch64

yaronkaikov · 2025-01-31T10:32:19Z

We need the same for 2025.1

denesb · 2025-01-31T13:09:13Z

@yaronkaikov @syuu1228 this discussion doesn't really belong to the seastar repository, but if it was started here already.... I created branch-2025.1, branch-6.2 and branch-6.1 in scylla-seastar. Please open backport PRs and I'll merge them.

avikivity · 2025-01-31T17:50:20Z

I queued commits that change .gitmodules to point to scylla-seastar.git.

syuu1228 requested a review from yaronkaikov January 18, 2025 03:47

yaronkaikov requested review from avikivity, denesb and tchaikov January 19, 2025 08:50

denesb reviewed Jan 20, 2025

View reviewed changes

tchaikov reviewed Jan 20, 2025

View reviewed changes

syuu1228 force-pushed the cpu_topology_unavailable branch from 7e81ac7 to 4f14ad6 Compare January 24, 2025 14:48

tchaikov reviewed Jan 27, 2025

View reviewed changes

syuu1228 force-pushed the cpu_topology_unavailable branch from 4f14ad6 to 662bc23 Compare January 30, 2025 10:49

syuu1228 force-pushed the cpu_topology_unavailable branch from 662bc23 to 63d95e9 Compare January 30, 2025 11:10

tchaikov approved these changes Jan 30, 2025

View reviewed changes

avikivity merged commit b0a9f89 into scylladb:master Jan 30, 2025
15 checks passed

This was referenced Feb 3, 2025

resource: fallback to sysconf when failed to detect memory size from hwloc for branch-6.1 scylladb/scylla-seastar#6

Merged

resource: fallback to sysconf when failed to detect memory size from hwloc for branch-6.2 scylladb/scylla-seastar#7

Merged

syuu1228 mentioned this pull request Feb 3, 2025

resource: fallback to sysconf when failed to detect memory size from hwloc for branch-2025.1 scylladb/scylla-seastar#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

syuu1228 commented Jan 18, 2025

denesb left a comment

tchaikov left a comment •

edited

Loading

tchaikov Jan 20, 2025

tchaikov Jan 20, 2025

tchaikov Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

syuu1228 commented Jan 24, 2025

avikivity commented Jan 24, 2025

syuu1228 commented Jan 27, 2025

tchaikov left a comment

tchaikov Jan 27, 2025

syuu1228 Jan 30, 2025

tchaikov Jan 27, 2025 •

edited

Loading

syuu1228 Jan 30, 2025

tchaikov Jan 27, 2025

syuu1228 Jan 30, 2025

tchaikov Jan 27, 2025

syuu1228 Jan 30, 2025

tchaikov Jan 27, 2025 •

edited

Loading

syuu1228 Jan 30, 2025

tchaikov Jan 30, 2025

tchaikov left a comment

yaronkaikov commented Jan 30, 2025

syuu1228 commented Jan 30, 2025

syuu1228 commented Jan 30, 2025

yaronkaikov commented Jan 30, 2025

avikivity commented Jan 30, 2025

yaronkaikov commented Jan 31, 2025

yaronkaikov commented Jan 31, 2025

denesb commented Jan 31, 2025

avikivity commented Jan 31, 2025

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Conversation

syuu1228 commented Jan 18, 2025

denesb left a comment

Choose a reason for hiding this comment

tchaikov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

syuu1228 commented Jan 24, 2025

avikivity commented Jan 24, 2025

syuu1228 commented Jan 27, 2025

tchaikov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov left a comment

Choose a reason for hiding this comment

yaronkaikov commented Jan 30, 2025

syuu1228 commented Jan 30, 2025

syuu1228 commented Jan 30, 2025

yaronkaikov commented Jan 30, 2025

avikivity commented Jan 30, 2025

yaronkaikov commented Jan 31, 2025

yaronkaikov commented Jan 31, 2025

denesb commented Jan 31, 2025

avikivity commented Jan 31, 2025

tchaikov left a comment •

edited

Loading

tchaikov Jan 27, 2025 •

edited

Loading

tchaikov Jan 27, 2025 •

edited

Loading