Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

syuu1228
Copy link
Contributor

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.

It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on resource.cc.
hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797

Copy link
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asked to review so I did. I'm not familar with this code but from a high level it looks good to me.
Someone more familiar with this should also review.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general. in addition to the inlined comments.

  • could you please use a more specific prefix in the title of the commit message? like: "resource: "
  • and use a more specific title. like "fall back to single io group if hwloc fails to work".

BTW, could even split the commit into two. one for moving the non-hwloc code up. the other for using it when hwloc fails tell the CPU topology.

// cannot receive error from the API.
// Therefore, we have to detect cpu topology availability in our code.
static bool is_hwloc_available() {
const std::string cpux_properties[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, std::string_view would suffice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and better off referencing the related hwloc function, so that the posterity understand why we are using this logic to determine if hwloc is able to identify the CPU topology.

}

// cpu0 might be offline, try to check first online cpu.
auto online = read_first_line_as<std::string>("/sys/devices/system/cpu/online");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we could potentially use read_first_line_as<unsigned>("/sys/devices/system/cpu/online")
and handle the exception. If this approach wouldn't significantly simplify the implementation, feel free to keep the current solution.

@avikivity
Copy link
Member

I think this is too extreme. Here's hwloc-ls output on m7gd.16xlarge:

$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)
  PU L#4 (P#4)
  PU L#5 (P#5)
  PU L#6 (P#6)
  PU L#7 (P#7)
  PU L#8 (P#8)
  PU L#9 (P#9)
  PU L#10 (P#10)
  PU L#11 (P#11)
  PU L#12 (P#12)
  PU L#13 (P#13)
  PU L#14 (P#14)
  PU L#15 (P#15)
  PU L#16 (P#16)
  PU L#17 (P#17)
  PU L#18 (P#18)
  PU L#19 (P#19)
  PU L#20 (P#20)
  PU L#21 (P#21)
  PU L#22 (P#22)
  PU L#23 (P#23)
  PU L#24 (P#24)
  PU L#25 (P#25)
  PU L#26 (P#26)
  PU L#27 (P#27)
  PU L#28 (P#28)
  PU L#29 (P#29)
  PU L#30 (P#30)
  PU L#31 (P#31)
  PU L#32 (P#32)
  PU L#33 (P#33)
  PU L#34 (P#34)
  PU L#35 (P#35)
  PU L#36 (P#36)
  PU L#37 (P#37)
  PU L#38 (P#38)
  PU L#39 (P#39)
  PU L#40 (P#40)
  PU L#41 (P#41)
  PU L#42 (P#42)
  PU L#43 (P#43)
  PU L#44 (P#44)
  PU L#45 (P#45)
  PU L#46 (P#46)
  PU L#47 (P#47)
  PU L#48 (P#48)
  PU L#49 (P#49)
  PU L#50 (P#50)
  PU L#51 (P#51)
  PU L#52 (P#52)
  PU L#53 (P#53)
  PU L#54 (P#54)
  PU L#55 (P#55)
  PU L#56 (P#56)
  PU L#57 (P#57)
  PU L#58 (P#58)
  PU L#59 (P#59)
  PU L#60 (P#60)
  PU L#61 (P#61)
  PU L#62 (P#62)
  PU L#63 (P#63)
  HostBridge
    PCI 00:04.0 (NVMExp)
      Block(Disk) "nvme1n1"
    PCI 00:05.0 (Ethernet)
      Net "eth0"
    PCI 00:1e.0 (NVMExp)
      Block(Disk) "nvme0n1"
    PCI 00:1f.0 (NVMExp)
      Block(Disk) "nvme2n1"

So it aborted linux discovery, but was still able to keep going. Maybe hwloc-ls is telling hwloc to fall back to alternative methods if needed, and we are not.

@avikivity
Copy link
Member

I think the problem is that hwloc doesn't report the memory as belonging to any NUMA nodes (on a normal machine the NUMA nodes have memory counts).

    // Divide local memory to cpus
    for (auto&& cs : cpu_sets()) {
        auto cpu_id = hwloc_bitmap_first(cs);
        assert(cpu_id != -1);
        auto node = cpu_to_node.at(cpu_id);
        cpu this_cpu;
        this_cpu.cpu_id = cpu_id;
        size_t remain = mem_per_proc - alloc_from_node(this_cpu, node, topo_used_mem, mem_per_proc);

        remains.emplace_back(std::move(this_cpu), remain);
    }

    // Divide the rest of the memory
    auto depth = hwloc_get_type_or_above_depth(topology, HWLOC_OBJ_NUMANODE);
    for (auto&& [this_cpu, remain] : remains) {
        auto node = cpu_to_node.at(this_cpu.cpu_id);
        auto obj = node;

        while (remain) {
            remain -= alloc_from_node(this_cpu, obj, topo_used_mem, remain);
            do {
                obj = hwloc_get_next_obj_by_depth(topology, depth, obj);
            } while (!obj);
            if (obj == node)
                break;
        }
        assert(!remain);
        ret.cpus.push_back(std::move(this_cpu));
    }

We need to add a third loop that allocates non-NUMA memory if the second loop fails.

@avikivity
Copy link
Member

Bad NUMA detection:

  NUMANode L#0 (P#0)

Good NUMA detection:

    NUMANode L#0 (P#0 126GB)

So we just need to detect the case where the detected NUMA memory (here 0) is less that the memory we want to allocate, and treat that case too.

@syuu1228 syuu1228 force-pushed the cpu_topology_unavailable branch from 7e81ac7 to 4f14ad6 Compare January 24, 2025 14:48
@syuu1228
Copy link
Contributor Author

@avikivity I completely rewrited the patch, now it does not use non-hwloc functions, it fixup memory size on hwloc objects instead.
How about this?

4f14ad6

@avikivity
Copy link
Member

Looks good, I'll wait for @tchaikov to review too.

@syuu1228
Copy link
Contributor Author

@tchaikov Could you review the patch again?

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am sorry for the latency.

left a few comments.

@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) {
return (num + denom - 1) / denom;
}

static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) {
static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am interested in why you added the inline specifier here? not because i am against this, just because i don't understand why it is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop inline since it is not necessary.

// This code does not assume that there are multiple nodes,
// but when this 'if' condition is met, hwloc fails to detect
// the hardware configuration and is expected to operate as
// a single node configuration, so it should work correctly.
Copy link
Contributor

@tchaikov tchaikov Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably could improve this comment a little bit (just for better readability, and assuming the single-node configuration does indeed apply):

// If hwloc fails to detect the hardware topology, it falls back to treating
// the system as a single-node configuration. While this code supports
// multi-node setups, the fallback behavior is safe and will function
// correctly in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged new comment

@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) {
return (num + denom - 1) / denom;
}

static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) {
static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better off renaming this function to get_memory_of_node(), so that we can reuse it for retrieving the total size of memory of this machine. where the node is an object of HWLOC_OBJ_MACHINE. also, because we retrieve total_memory when HWLOC_API_VERSION >= 0x00020000 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied this idea, use common functions for both machine and node.
New function name are get_memory_from_hwloc_obj() and set_memory_from_hwloc_obj().

@@ -579,6 +592,10 @@ resources allocate(configuration& c) {
#else
auto available_memory = machine->memory.total_memory;
#endif
if (!available_memory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could reuse get_local_memory_from_node() for retrieving the size of memory owned by the machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

// but when this 'if' condition is met, hwloc fails to detect
// the hardware configuration and is expected to operate as
// a single node configuration, so it should work correctly.
set_local_memory_to_node(node, available_memory);
Copy link
Contributor

@tchaikov tchaikov Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am confused. you're assigning the total size of memory of the whole machine to a NUMA node. on my machine, it is 32GiB. assuming my machine has two NUMA nodes and the OS is NUMA-awared, with your algorithm, we will be splitting 64GiB memory across all the processing units, am i right?

how do we assure that there is only a single NUMA node here ? is this guaranteed by the algorithm or is it based on our observation ? can we verify this in the code, and at least print out an error message if this assumption is violated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when hwloc hits hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. error, it does not proceed rest of HW detection code, so it does not have NUMA nodes information, memory information, etc.
see: https://github.com/open-mpi/hwloc/blob/09f88977dd85b01b2e4915a48c95a90cea74754c/hwloc/topology-linux.c#L5987

I verified this on m8g.metal-48xl, which has 2 NUMA node.
On m8g.metal-48xl, cpu topology issue does not occur, so I emulated the issue by hiding files under topology directory, like this:

for i in /sys/devices/system/cpu/cpu*/topology;do sudo mount -t ramfs none $i;done

Before mounted ramfs, Seastar output is like this:

DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU0 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU1 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU2 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU3 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU4 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU5 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU6 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU7 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU8 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU9 to NUMA0

...


DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU100 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU101 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU102 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU103 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU104 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU105 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU106 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU107 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU108 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU109 to NUMA1

...

DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU191 to NUMA1

So hwloc reported 2 NUMA node.

After mounted ramfs, hwloc reports single NUMA node like this:

DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU0 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU1 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU2 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU3 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU4 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU5 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU6 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU7 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU8 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU9 to NUMA0

...


DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU100 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU101 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU102 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU103 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU104 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU105 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU106 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU107 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU108 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU109 to NUMA0

...

DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU191 to NUMA0

So we can say when the issue occured, NUMA node should be single node.

can we verify this in the code, and at least print out an error message if this assumption is violated?

I added assert(num_nodes == 1); for safety.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! it helps to convince us it's indeed safe.

@syuu1228 syuu1228 force-pushed the cpu_topology_unavailable branch from 4f14ad6 to 662bc23 Compare January 30, 2025 10:49
…hwloc

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar
program such as Scylla fails to startup with following error message:
```
$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.
```

It seems like hwloc is failed to initialize because of
/sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured,
and found that alloc_from_node() is failing because node->total_memory is 0.
It is likely because of failure of hwloc initialize described above.

I also found that calculate_memory() going wrong since
machine->total_memory is also 0.

To avoid the error on such environment, we need to fixup memory size on
both machine->total_memory and node->total_memory.
We can use sysconf(_SC_PAGESIZE) * sysconf(_SC_PHYS_PAGES) for this,
just like we do on non-hwloc version of allocate().

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797
@syuu1228 syuu1228 force-pushed the cpu_topology_unavailable branch from 662bc23 to 63d95e9 Compare January 30, 2025 11:10
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yaronkaikov
Copy link

@avikivity Can we now merge it? we need it for returning to ASG for aarch64

@avikivity avikivity merged commit b0a9f89 into scylladb:master Jan 30, 2025
15 checks passed
@syuu1228
Copy link
Contributor Author

@syuu1228
Copy link
Contributor Author

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

@yaronkaikov
Copy link

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

I think it because we didn't any updates since we branch, IIUC we should now create the branch and do the backport. right @avikivity @denesb ?

@avikivity
Copy link
Member

@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch.

I think it because we didn't any updates since we branch, IIUC we should now create the branch and do the backport. right @avikivity @denesb ?

Right.

@yaronkaikov
Copy link

@denesb @avikivity Can you create the new branch and backport this change? we need it in all releases so we can switch back to ASG for aarch64

@yaronkaikov
Copy link

We need the same for 2025.1

@denesb
Copy link
Contributor

denesb commented Jan 31, 2025

@yaronkaikov @syuu1228 this discussion doesn't really belong to the seastar repository, but if it was started here already.... I created branch-2025.1, branch-6.2 and branch-6.1 in scylla-seastar. Please open backport PRs and I'll merge them.

@avikivity
Copy link
Member

I queued commits that change .gitmodules to point to scylla-seastar.git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla fails to startup on Fedora aarch64 AMI
5 participants