diff --git a/docs/deployments/aws-ec2/install-caching-nodes.md b/docs/deployments/aws-ec2/install-caching-nodes.md index ebc3dd4..25fa2fd 100644 --- a/docs/deployments/aws-ec2/install-caching-nodes.md +++ b/docs/deployments/aws-ec2/install-caching-nodes.md @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096 !!! info To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on -[Huge Pages](../deployment-planning/node-sizing.md#huge-pages). +[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements). ```bash demo@worker-1 ~> sudo systemctl restart kubelet diff --git a/docs/deployments/baremetal/install-caching-nodes.md b/docs/deployments/baremetal/install-caching-nodes.md index ebc3dd4..25fa2fd 100644 --- a/docs/deployments/baremetal/install-caching-nodes.md +++ b/docs/deployments/baremetal/install-caching-nodes.md @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096 !!! info To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on -[Huge Pages](../deployment-planning/node-sizing.md#huge-pages). +[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements). ```bash demo@worker-1 ~> sudo systemctl restart kubelet diff --git a/docs/deployments/deployment-planning/node-sizing.md b/docs/deployments/deployment-planning/node-sizing.md index 9d3f3e8..f58ef94 100644 --- a/docs/deployments/deployment-planning/node-sizing.md +++ b/docs/deployments/deployment-planning/node-sizing.md @@ -11,8 +11,7 @@ requirements are elaborated below, whether deployed on a private or public cloud The following sizing information is meant for production environments. !!! warning - Simplyblock always recommends using physical cores over virtual and hyper-threading cores. If the sizing document - discusses virtual CPUs (vCPU), it means 0.5 physical CPUs. This corresponds to a typical hyper-threaded CPU core + If the sizing document discusses virtual CPUs (vCPU), it means 0.5 physical CPUs. This corresponds to a typical hyper-threaded CPU core x86-64. This also relates to how AWS EC2 cores are measured. ## Management Nodes @@ -33,92 +32,87 @@ The following hardware sizing specifications are recommended: ## Storage Nodes +!!! warning + A storage node is not equal to a physical or virtual host. For optimal performance, at least two storage nodes are deployed on a two + socket system (one per NuMA socket), for optimal performance even four storage nodes are recommended (2 per socket). + A suitably sized storage node cluster is required to ensure optimal performance and scalability. Storage nodes are responsible for handling all I/O operations and data services for logical volumes and snapshots. The following hardware sizing specifications are recommended: -| Hardware | | -|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| CPU | Minimum 8 vCPUs.
3 cores are dedicated to service threads.
Additionally, available cores are allocated to worker threads. Each additional core contributes about 200.000 IOPS to the node's performance profile (disregarding other limiting factors such as network bandwidth). | -| RAM | Minimum 4 GiB (for operating system) | -| Disk | Minimum 5 GiB boot volume | +| Hardware | | +|----------|-----------------------------------------------------------------------------------------------------------| +| CPU | Minimum 5 vCPU | +| RAM | Minimum 4 GiB | +| Disk | Minimum 10 GiB free space on boot volume | ### Memory Requirements In addition to the above RAM requirements, the storage node requires additional memory based on the managed storage capacity. -Simplyblock works with two types of memory: [huge pages memory](https://wiki.debian.org/Hugepages), which has to be -pre-allocated prior to starting the storage node services and is then exclusively assigned to simplyblock, as well as -system memory, which is required on demand. - -#### Huge Pages - -The exact amount of huge page memory is calculated when adding or restarting a node based on two parameters: the maximum -amount of storage available in the cluster and the maximum amount of logical volumes which can be created on the node: - -| Unit | Memory Requirement | -|--------------------------------|--------------------| -| Per logical volume | 6 MiB | -| Per TB of max. cluster storage | 256 MiB | +While a certain amount of RAM is pre-reserved for [SPDK](../../important-notes/terminology.md#spdk-storage-performance-development-kit), +another part is dynamically pre-allocated. Users should ensure that the full amount of required RAM is available +(reserved) from the system as long as simplyblock is running. -!!! recommendation - For bare metal, virtualized, or disaggregated deployments, simplyblock recommends allocating around 75% of the - available memory as huge pages, minimizing memory overhead.

- For hyper-converged deployments, please use the [huge pages calculator](../../reference/huge-pages-calculator.md). +The exact amount of memory is calculated when adding or restarting a node based on two parameters: -If not enough huge pages memory is available, the node will refuse to start. In this case, you may check -`/proc/meminfo` for total, reserved, and available huge page memory on a corresponding node. +- The maximum amount of storage available in the cluster +- The maximum amount of logical volumes which can be created on the node -Execute the following command to allocate temporary huge pages while the system is already running. It will allocate -8 GiB in huge pages. Please adjust the number of huge pages depending on your requirements. +| Unit | Memory Requirement | +|-------------------------------------|--------------------| +| Fixed amount | 2 GiB | +| Per logical volume | 25 MiB | +| % of max. utilized capacity on node | 0.05 | +| % of NVMe capacity on node | 0.025 | -```bash title="Allocate temporary huge pages" -sudo sysctl vm.nr_hugepages=4096 -``` - -Since the allocation is temporary, it will disappear after a system reboot. - -!!! recommendation - Simplyblock recommends to pre-allocate huge pages via the bootloader commandline. This prevents fragmentation of - the huge pages memory and ensures a continuous memory area to be allocated. - ```plain title="GRUB configuration change" - GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} default_hugepagesz=2MB hugepagesz=2MB hugepages=4096" - ``` - Afterward, you need to persist the change to take effect. - ```bash title="Persist GRUB configuration" - sudo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg +!!! info + Example: A node has 10 NVMe devices with 8TB each. The cluster has 3 nodes and total capacity of 240 TB. + Logical volumes are equally distributed across nodes, and it is planned to use up to 1,000 logical volumes on + each node. Hence, the following formula: + ```plain + (2 + (0.025 * 1,000) + (0.05 * 240,000 GB / 3) + (0.025 * 80,000 GB) = 64.5 GB ``` -#### Conventional Memory - -Additionally to huge pages, simplyblock requires dynamically allocatable conventional system memory. The required -amount depends on the utilized storage. - -| Unit | Memory Requirement | -|----------------------------------------|--------------------| -| Per TiB of used local SSD storage | 256 MiB | -| Per TiB of used logical volume storage | 256 MiB | +If not enough memory is available, the node will refuse to start. In this case, `/proc/meminfo` may be checked for +total, reserved, and available system and huge page memory on a corresponding node. !!! info - Used local SSD storage is the physically utilized capacity of the local NVMe devices on the storage node at a point - in time. Used logical volume storage is the physically utilized capacity of all logical volumes on a specific - storage node at a point in time. + Part of the memory will be allocated as huge-page memory. In case of a high degree of memory fragmentation, a system + may not be able to allocate enough of huge-page memory even if there is enough of system memory available. If the + node fails to start-up, a system reboot may ensure enough free memory. + + The following command can be executed to temporarily allocate huge pages while the system is already running. It + will allocate 8 GiB in huge pages. The number of huge pages must be adjusted depending on the requirements. The + [Huge Pages Calculator](../../reference/huge-pages-calculator.md) helps with calculating the required number of + huge pages. + + ```bash title="Allocate temporary huge pages" + sudo sysctl vm.nr_hugepages=4096 + ``` + + Since the allocation is temporary, it will disappear after a system reboot. It must be ensured that either the + setting is re-applied after each system reboot or persisted to be automatically applied on systemm boot up. ### Storage Planning Simplyblock storage nodes require one or more NVMe devices to provide storage capacity to the distributed storage pool of a storage cluster. -!!! recommendation - Simplyblock requires at least three similar sized NVMe devices per storage node. - Furthermore, simplyblock storage nodes require one additional NVMe device with less capacity as a journaling device. The journaling device becomes part of the distributed record journal, keeping track of all changes before being persisted into their final position. This helps with write performance and transactional behavior by using a write-ahead log structure and replaying the journal in case of a issue. +!!! warning + Simplyblock does not work with device partitions or claimed (mounted) devices. It must be ensured that all NVMe + devices to be used by simplyblock are unmounted and not busy. + + Any partition must be removed from the NVMe devices prior to installing simplyblock. Furthermore, NVMe devices must + be low-level formatted with 4KB block size (lbaf: 12). More information can be found in [NVMe Low-Level Format](../../reference/nvme-low-level-format.md). + !!! info Secondary nodes don't need NVMe storage disks. @@ -129,15 +123,7 @@ write-through cache to a disaggregated cluster, improving access latency substan | Hardware | | |----------|------------------------------------------------------| -| CPU | Minimum 2 vCPU, better 2 physical cores. | -| RAM | Minimum 2 GiB, plus 25% of the configured huge pages | - -In addition to the base conventional memory configuration, a caching node requires huge pages memory. Calculating the -required huge pages memory can be achieved using the following formula: +| CPU | Minimum 6 vCPU | +| RAM | Minimum 4 GiB | -```plain title="Huge page calculation (caching node)" -huge_pages_size=2GiB + 0.0025 * nvme_size_gib -``` -With the above formula, a locally-attached NVMe device of 1.9TiB would require 6.75GiB huge pages memory -(`2 + 0.0025 * 1900`). diff --git a/docs/deployments/kubernetes/install-caching-nodes.md b/docs/deployments/kubernetes/install-caching-nodes.md index b3ffa8c..0352b9c 100644 --- a/docs/deployments/kubernetes/install-caching-nodes.md +++ b/docs/deployments/kubernetes/install-caching-nodes.md @@ -33,7 +33,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096 !!! info To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on -[Huge Pages](../deployment-planning/node-sizing.md#huge-pages). +[Huge Pages](../deployment-planning/node-sizing.md#memory-requirements). ```bash demo@worker-1 ~> sudo systemctl restart kubelet diff --git a/docs/deployments/kubernetes/install-simplyblock/hyper-converged.md b/docs/deployments/kubernetes/install-simplyblock/hyper-converged.md index b288079..4f10e9b 100644 --- a/docs/deployments/kubernetes/install-simplyblock/hyper-converged.md +++ b/docs/deployments/kubernetes/install-simplyblock/hyper-converged.md @@ -17,7 +17,7 @@ demo@worker-1 ~> sudo sysctl -w vm.nr_hugepages=4096 !!! info To see how huge pages can be pre-reserved at boot time, see the node sizing documentation section on - [Huge Pages](../../deployment-planning/node-sizing.md#huge-pages). + [Huge Pages](../../deployment-planning/node-sizing.md#memory-requirements). ```bash diff --git a/docs/reference/nvme-low-level-format.md b/docs/reference/nvme-low-level-format.md new file mode 100644 index 0000000..fc50207 --- /dev/null +++ b/docs/reference/nvme-low-level-format.md @@ -0,0 +1,89 @@ +--- +title: "NVMe Low-Level Format" +weight: 20600 +--- + +NVMe devices store data in configurable sized blocks. Simplyblock expects NVMe devices to provide 4 KB block internal +block size. Hence, to prevent data loss in case of a sudden power outage, NVMe devices must be formatted for a +specific LBA format. + +!!! danger + Failing to format NVMe devices with the correct LBA format can lead to data loss or data corruption in the case + of a sudden power outage or other loss of power. + +The `lsblk` is the best way to find all NVMe devices attached to a system. + +```plain title="Example output of lsblk" +[demo@demo-3 ~]# sudo lsblk +NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS +sda 8:0 0 30G 0 disk +├─sda1 8:1 0 1G 0 part /boot +└─sda2 8:2 0 29G 0 part + ├─rl-root 253:0 0 26G 0 lvm / + └─rl-swap 253:1 0 3G 0 lvm [SWAP] +nvme3n1 259:0 0 6.5G 0 disk +nvme2n1 259:1 0 70G 0 disk +nvme1n1 259:2 0 70G 0 disk +nvme0n1 259:3 0 70G 0 disk +``` + +In the example, we see four NVMe devices. Three devices of 70GiB and one device with 6.5GiB storage capacity. + +To find the correct LBA format (_lbaf_) for each of the devices, the `nvme` cli can be used. + +```bash title="Show NVMe namespace information" +sudo nvme id-ns /dev/nvmeXnY +``` + +The output depends on the NVMe device itself, but looks something like this: + +```plain title="Example output of NVMe namespace information" +[demo@demo-3 ~]# sudo nvme id-ns /dev/nvme0n1 +NVME Identify Namespace 1: +... +lbaf 0 : ms:0 lbads:9 rp:0 +lbaf 1 : ms:8 lbads:9 rp:0 +lbaf 2 : ms:16 lbads:9 rp:0 +lbaf 3 : ms:64 lbads:9 rp:0 +lbaf 4 : ms:0 lbads:12 rp:0 (in use) +lbaf 5 : ms:8 lbads:12 rp:0 +lbaf 6 : ms:16 lbads:12 rp:0 +lbaf 7 : ms:64 lbads:12 rp:0 +``` + +From this output, the required _lbaf_ configuration can be found. The necessary configuration has to have the following +values: + +| Property | Value | +|----------|-------| +| ms | 0 | +| lbads | 12 | +| rp | 0 | + +In the example, the required LBA format is 4. If a NVMe device doesn't have that combination, any other lbads=12 +combination will work. However, simplyblock recommends to ask for the best available combination. + +In our example, the device is already formatted with the correct _lbaf_ (see the "in use"). It is, however, +recommended to always format the device before use. + +To format the drive, the `nvme` cli is used again. + +```bash title="Formatting the NVMe device" +sudo nvme format --lbaf= --ses=0 /dev/nvmeXnY +``` + +The output of the command should give a successful response when executing similar to the below example. + +```plain title="Example output of NVMe device formatting" +[demo@demo-3 ~]# sudo nvme format --lbaf=4 --ses=0 /dev/nvme0n1 +You are about to format nvme0n1, namespace 0x1. +WARNING: Format may irrevocably delete this device's data. +You have 10 seconds to press Ctrl-C to cancel this operation. + +Use the force [--force] option to suppress this warning. +Sending format operation ... +Success formatting namespace:1 +``` + +!!! warning + This operation needs to be repeated for each NVMe device that will be handled by simplyblock.