Skip to content

Update node-sizing.md #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/deployments/aws-ec2/install-caching-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096

!!! info
To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on
[Huge Pages](../deployment-planning/node-sizing.md#huge-pages).
[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements).

```bash
demo@worker-1 ~> sudo systemctl restart kubelet
Expand Down
2 changes: 1 addition & 1 deletion docs/deployments/baremetal/install-caching-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096

!!! info
To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on
[Huge Pages](../deployment-planning/node-sizing.md#huge-pages).
[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements).

```bash
demo@worker-1 ~> sudo systemctl restart kubelet
Expand Down
122 changes: 54 additions & 68 deletions docs/deployments/deployment-planning/node-sizing.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ requirements are elaborated below, whether deployed on a private or public cloud
The following sizing information is meant for production environments.

!!! warning
Simplyblock always recommends using physical cores over virtual and hyper-threading cores. If the sizing document
discusses virtual CPUs (vCPU), it means 0.5 physical CPUs. This corresponds to a typical hyper-threaded CPU core
If the sizing document discusses virtual CPUs (vCPU), it means 0.5 physical CPUs. This corresponds to a typical hyper-threaded CPU core
x86-64. This also relates to how AWS EC2 cores are measured.

## Management Nodes
Expand All @@ -33,92 +32,87 @@ The following hardware sizing specifications are recommended:

## Storage Nodes

!!! warning
A storage node is not equal to a physical or virtual host. For optimal performance, at least two storage nodes are deployed on a two
socket system (one per NuMA socket), for optimal performance even four storage nodes are recommended (2 per socket).

A suitably sized storage node cluster is required to ensure optimal performance and scalability. Storage nodes are
responsible for handling all I/O operations and data services for logical volumes and snapshots.

The following hardware sizing specifications are recommended:

| Hardware | |
|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CPU | Minimum 8 vCPUs.<br/>3 cores are dedicated to service threads.<br/>Additionally, available cores are allocated to worker threads. Each additional core contributes about 200.000 IOPS to the node's performance profile (disregarding other limiting factors such as network bandwidth). |
| RAM | Minimum 4 GiB (for operating system) |
| Disk | Minimum 5 GiB boot volume |
| Hardware | |
|----------|-----------------------------------------------------------------------------------------------------------|
| CPU | Minimum 5 vCPU |
| RAM | Minimum 4 GiB |
| Disk | Minimum 10 GiB free space on boot volume |

### Memory Requirements

In addition to the above RAM requirements, the storage node requires additional memory based on the managed storage
capacity.

Simplyblock works with two types of memory: [huge pages memory](https://wiki.debian.org/Hugepages), which has to be
pre-allocated prior to starting the storage node services and is then exclusively assigned to simplyblock, as well as
system memory, which is required on demand.

#### Huge Pages

The exact amount of huge page memory is calculated when adding or restarting a node based on two parameters: the maximum
amount of storage available in the cluster and the maximum amount of logical volumes which can be created on the node:

| Unit | Memory Requirement |
|--------------------------------|--------------------|
| Per logical volume | 6 MiB |
| Per TB of max. cluster storage | 256 MiB |
While a certain amount of RAM is pre-reserved for [SPDK](../../important-notes/terminology.md#spdk-storage-performance-development-kit),
another part is dynamically pre-allocated. Users should ensure that the full amount of required RAM is available
(reserved) from the system as long as simplyblock is running.

!!! recommendation
For bare metal, virtualized, or disaggregated deployments, simplyblock recommends allocating around 75% of the
available memory as huge pages, minimizing memory overhead.<br/><br/>
For hyper-converged deployments, please use the [huge pages calculator](../../reference/huge-pages-calculator.md).
The exact amount of memory is calculated when adding or restarting a node based on two parameters:

If not enough huge pages memory is available, the node will refuse to start. In this case, you may check
`/proc/meminfo` for total, reserved, and available huge page memory on a corresponding node.
- The maximum amount of storage available in the cluster
- The maximum amount of logical volumes which can be created on the node

Execute the following command to allocate temporary huge pages while the system is already running. It will allocate
8 GiB in huge pages. Please adjust the number of huge pages depending on your requirements.
| Unit | Memory Requirement |
|-------------------------------------|--------------------|
| Fixed amount | 2 GiB |
| Per logical volume | 25 MiB |
| % of max. utilized capacity on node | 0.05 |
| % of NVMe capacity on node | 0.025 |

```bash title="Allocate temporary huge pages"
sudo sysctl vm.nr_hugepages=4096
```

Since the allocation is temporary, it will disappear after a system reboot.

!!! recommendation
Simplyblock recommends to pre-allocate huge pages via the bootloader commandline. This prevents fragmentation of
the huge pages memory and ensures a continuous memory area to be allocated.
```plain title="GRUB configuration change"
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} default_hugepagesz=2MB hugepagesz=2MB hugepages=4096"
```
Afterward, you need to persist the change to take effect.
```bash title="Persist GRUB configuration"
sudo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
!!! info
Example: A node has 10 NVMe devices with 8TB each. The cluster has 3 nodes and total capacity of 240 TB.
Logical volumes are equally distributed across nodes, and it is planned to use up to 1,000 logical volumes on
each node. Hence, the following formula:
```plain
(2 + (0.025 * 1,000) + (0.05 * 240,000 GB / 3) + (0.025 * 80,000 GB) = 64.5 GB
```

#### Conventional Memory

Additionally to huge pages, simplyblock requires dynamically allocatable conventional system memory. The required
amount depends on the utilized storage.

| Unit | Memory Requirement |
|----------------------------------------|--------------------|
| Per TiB of used local SSD storage | 256 MiB |
| Per TiB of used logical volume storage | 256 MiB |
If not enough memory is available, the node will refuse to start. In this case, `/proc/meminfo` may be checked for
total, reserved, and available system and huge page memory on a corresponding node.

!!! info
Used local SSD storage is the physically utilized capacity of the local NVMe devices on the storage node at a point
in time. Used logical volume storage is the physically utilized capacity of all logical volumes on a specific
storage node at a point in time.
Part of the memory will be allocated as huge-page memory. In case of a high degree of memory fragmentation, a system
may not be able to allocate enough of huge-page memory even if there is enough of system memory available. If the
node fails to start-up, a system reboot may ensure enough free memory.

The following command can be executed to temporarily allocate huge pages while the system is already running. It
will allocate 8 GiB in huge pages. The number of huge pages must be adjusted depending on the requirements. The
[Huge Pages Calculator](../../reference/huge-pages-calculator.md) helps with calculating the required number of
huge pages.

```bash title="Allocate temporary huge pages"
sudo sysctl vm.nr_hugepages=4096
```

Since the allocation is temporary, it will disappear after a system reboot. It must be ensured that either the
setting is re-applied after each system reboot or persisted to be automatically applied on systemm boot up.

### Storage Planning

Simplyblock storage nodes require one or more NVMe devices to provide storage capacity to the distributed storage pool
of a storage cluster.

!!! recommendation
Simplyblock requires at least three similar sized NVMe devices per storage node.

Furthermore, simplyblock storage nodes require one additional NVMe device with less capacity as a journaling device.
The journaling device becomes part of the distributed record journal, keeping track of all changes before being
persisted into their final position. This helps with write performance and transactional behavior by using a
write-ahead log structure and replaying the journal in case of a issue.

!!! warning
Simplyblock does not work with device partitions or claimed (mounted) devices. It must be ensured that all NVMe
devices to be used by simplyblock are unmounted and not busy.

Any partition must be removed from the NVMe devices prior to installing simplyblock. Furthermore, NVMe devices must
be low-level formatted with 4KB block size (lbaf: 12). More information can be found in [NVMe Low-Level Format](../../reference/nvme-low-level-format.md).

!!! info
Secondary nodes don't need NVMe storage disks.

Expand All @@ -129,15 +123,7 @@ write-through cache to a disaggregated cluster, improving access latency substan

| Hardware | |
|----------|------------------------------------------------------|
| CPU | Minimum 2 vCPU, better 2 physical cores. |
| RAM | Minimum 2 GiB, plus 25% of the configured huge pages |

In addition to the base conventional memory configuration, a caching node requires huge pages memory. Calculating the
required huge pages memory can be achieved using the following formula:
| CPU | Minimum 6 vCPU |
| RAM | Minimum 4 GiB |

```plain title="Huge page calculation (caching node)"
huge_pages_size=2GiB + 0.0025 * nvme_size_gib
```

With the above formula, a locally-attached NVMe device of 1.9TiB would require 6.75GiB huge pages memory
(`2 + 0.0025 * 1900`).
2 changes: 1 addition & 1 deletion docs/deployments/kubernetes/install-caching-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096

!!! info
To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on
[Huge Pages](../deployment-planning/node-sizing.md#huge-pages).
[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements).

```bash
demo@worker-1 ~> sudo systemctl restart kubelet
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096

!!! info
To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on
[Huge Pages](../../deployment-planning/node-sizing.md#huge-pages).
[Huge Pages](../../deployment-planning/node-sizing.md#memory-requirements).


```bash
Expand Down
89 changes: 89 additions & 0 deletions docs/reference/nvme-low-level-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: "NVMe Low-Level Format"
weight: 20600
---

NVMe devices store data in configurable sized blocks. Simplyblock expects NVMe devices to provide 4 KB block internal
block size. Hence, to prevent data loss in case of a sudden power outage, NVMe devices must be formatted for a
specific LBA format.

!!! danger
Failing to format NVMe devices with the correct LBA format can lead to data loss or data corruption in the case
of a sudden power outage or other loss of power.

The `lsblk` is the best way to find all NVMe devices attached to a system.

```plain title="Example output of lsblk"
[demo@demo-3 ~]# sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 30G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 29G 0 part
├─rl-root 253:0 0 26G 0 lvm /
└─rl-swap 253:1 0 3G 0 lvm [SWAP]
nvme3n1 259:0 0 6.5G 0 disk
nvme2n1 259:1 0 70G 0 disk
nvme1n1 259:2 0 70G 0 disk
nvme0n1 259:3 0 70G 0 disk
```

In the example, we see four NVMe devices. Three devices of 70GiB and one device with 6.5GiB storage capacity.

To find the correct LBA format (_lbaf_) for each of the devices, the `nvme` cli can be used.

```bash title="Show NVMe namespace information"
sudo nvme id-ns /dev/nvmeXnY
```

The output depends on the NVMe device itself, but looks something like this:

```plain title="Example output of NVMe namespace information"
[demo@demo-3 ~]# sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
...
lbaf 0 : ms:0 lbads:9 rp:0
lbaf 1 : ms:8 lbads:9 rp:0
lbaf 2 : ms:16 lbads:9 rp:0
lbaf 3 : ms:64 lbads:9 rp:0
lbaf 4 : ms:0 lbads:12 rp:0 (in use)
lbaf 5 : ms:8 lbads:12 rp:0
lbaf 6 : ms:16 lbads:12 rp:0
lbaf 7 : ms:64 lbads:12 rp:0
```

From this output, the required _lbaf_ configuration can be found. The necessary configuration has to have the following
values:

| Property | Value |
|----------|-------|
| ms | 0 |
| lbads | 12 |
| rp | 0 |

In the example, the required LBA format is 4. If a NVMe device doesn't have that combination, any other lbads=12
combination will work. However, simplyblock recommends to ask for the best available combination.

In our example, the device is already formatted with the correct _lbaf_ (see the "in use"). It is, however,
recommended to always format the device before use.

To format the drive, the `nvme` cli is used again.

```bash title="Formatting the NVMe device"
sudo nvme format --lbaf=<lbaf> --ses=0 /dev/nvmeXnY
```

The output of the command should give a successful response when executing similar to the below example.

```plain title="Example output of NVMe device formatting"
[demo@demo-3 ~]# sudo nvme format --lbaf=4 --ses=0 /dev/nvme0n1
You are about to format nvme0n1, namespace 0x1.
WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.

Use the force [--force] option to suppress this warning.
Sending format operation ...
Success formatting namespace:1
```

!!! warning
This operation needs to be repeated for each NVMe device that will be handled by simplyblock.