Multi-RBD performance does not scale up well as fio-rbd #939

xin3liang · 2024-11-11T04:30:46Z

We do some 4k random read/write performance tests on the below testbed. And found that the Nvmeof gateway multi-rbd performance does not scale well as fio-rbd.

Hardware

Arm CPU: Kunpeng 920, 2.6GHz, 96 CPU cores, 4 numa nodes
X86 CPU: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz, 112 CPU cores, 2 numa nodes
Disk: 3 x ES3000 V6 NVMe SSD 3.2T per Arm server
Network: 1x MLNX ConnectX-5 100Gb IB，1x1G tcp

Software

OS: openEuler 22.03 LTS SP3， kernel 5.10.0-192.0.0.105.oe2203sp3
Ceph: main-nvmeof require revert commit "nvmeof gw monitor: disable by default"
SPDK: 24.05
nvmeof: 1.3.2
fio: fio-3.29

Deployment and Parameters Tuning

Deploy the Nvmeof gateway and Ceph cluster with cephadm.
To get a good backend Ceph performance for testing the Nvmf Gateway, we set a larger number to pg_num, set the replica size to 1, and bind each OSD to 4 cores.
Rebuild the nvmeof image without "--enable-debug" option as a release type build.
Tuning the CPU cores/mask of SPDK Nvmf target and Ceph client to bind their threads in the same NUMA as the 100Gbit NIC.
Make sure there are enough CPU cores for the nvmf target and Ceph client threads so that they don't meet the CPU bottleneck.

Each osd bind to 4 cores      
Set Ceph size=1 pg_num=16384 
# Note: nvmeof gw increase msg and io enqueue threads , bind ceph client, spdk tgt in the same numa as NIC    
ms_async_op_threads = 9   # 3->9      
librados_thread_count = 10 # 2->10      
x86 (4 spdk cores, all threads in same numa 1, NIC in numa 1)      
tgt_cmd_extra_args =               "-m 0xF0000000"      
librbd_core_mask = 0xFFFFFFF0000000FFFFFF00000000     
arm (6 spdk cores, all threads in numa  2,3, NIC in numa 2)      
tgt_cmd_extra_args =     "-m 0x3F000000000000"      
librbd_core_mask = 0xFFFFFFFFFFC0000000000000

FYI, in case someone is interested in the details of the hybrid x86 and arm Ceph Nvmf Gateway cluster deployment. Please refer to the attached pdf:
Ceph SPDK NVMe-oF Gateway Evaluation on openEuler on openEuler (1).pdf

Fio Running Cmds and Configs
We run fio tests on the client node with cmds RW=randwrite BS=4k IODEPTH=128 fio ./[fio_test-rbd.conf|fio_test-nvmeof.conf] --numjobs=1

(.venv) [root@client1 spdktest]# cat fio_test-rbd.conf
[global]
#stonewall
description="Run ${RW} ${BS} rbd test"
bs=${BS}
ioengine=rbd
clientname=admin
pool=nvmeof
#pool=test-pool
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=60m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
rbdname=fio_test_image1

[test-job2]
rbdname=fio_test_image2

[test-job3]
rbdname=fio_test_image3

[test-job4]
rbdname=fio_test_image4

[test-job5]
rbdname=fio_test_image5

(.venv) [root@client1 spdktest]# cat fio_test-nvmeof.conf
[global]
#stonewall
description="Run ${RW} ${BS} NVMe ssd test"
bs=${BS}
#ioengine=libaio
ioengine=io_uring
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=1m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
#filename=/dev/nvme2n1
filename=/dev/nvme2n2

[test-job2]
#filename=/dev/nvme2n3
#filename=/dev/nvme2n4
filename=/dev/nvme4n1
#filename=/dev/nvme4n2

#[test-job3]
#filename=/dev/nvme2n5
##filename=/dev/nvme2n6
#
#[test-job4]
#filename=/dev/nvme2n7
##filename=/dev/nvme2n8
#
#[test-job5]
#filename=/dev/nvme2n9
##filename=/dev/nvme2n10

The text was updated successfully, but these errors were encountered:

xin3liang · 2024-11-11T09:36:43Z

We notice that currently, one ceph-nvmeof gateway creates only one Ceph IO context(RADOS connection) with Ceph whereas fio creates one Ceph IO context with Ceph for each running job.

And Refer to two performance tuning guides below, one Ceph IO context can't support too many RBD images read/write access well.
And maybe the RBD Grouping Strategy(one Ceph IO Context per group) would help with the multi-RBD performance scale-up.

See P9-10 of:
https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf
Rbd Grouping Strategy:
https://www.intel.com/content/www/us/en/developer/articles/technical/performance-tuning-of-ceph-rbd.html

caroav · 2024-11-11T09:52:34Z

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more.
FYI @oritwas @leonidc @baum

xin3liang · 2024-11-11T10:30:52Z

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

Sounds cool, thanks @caroav . Will give it a try.
BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

caroav · 2024-11-11T10:33:12Z

BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

Yes I need to update the entire upstream nvmeof documentation. I will do it soon.

xin3liang · 2024-12-23T06:23:29Z

After setting bdevs_per_cluster = 1, it can scale now. Thanks.
Note, test data below SPDK uses 16 CPU cores.

github-project-automation bot added this to NVMe-oF Nov 11, 2024

github-project-automation bot moved this to 🆕 New in NVMe-oF Nov 11, 2024

xin3liang changed the title ~~Multi-rbd performance does not scale well as fio-rbd~~ Multi-RBD performance does not scale well as fio-rbd Nov 11, 2024

xin3liang mentioned this issue Nov 11, 2024

could ceph-nvmeof work well on multi reactor cores? #27

Open

xin3liang changed the title ~~Multi-RBD performance does not scale well as fio-rbd~~ Multi-RBD performance does not scale up well as fio-rbd Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-RBD performance does not scale up well as fio-rbd #939

Multi-RBD performance does not scale up well as fio-rbd #939

xin3liang commented Nov 11, 2024 •

edited

Loading

xin3liang commented Nov 11, 2024

caroav commented Nov 11, 2024

xin3liang commented Nov 11, 2024

caroav commented Nov 11, 2024 •

edited

Loading

xin3liang commented Dec 23, 2024

Multi-RBD performance does not scale up well as fio-rbd #939

Multi-RBD performance does not scale up well as fio-rbd #939

Comments

xin3liang commented Nov 11, 2024 • edited Loading

xin3liang commented Nov 11, 2024

caroav commented Nov 11, 2024

xin3liang commented Nov 11, 2024

caroav commented Nov 11, 2024 • edited Loading

xin3liang commented Dec 23, 2024

xin3liang commented Nov 11, 2024 •

edited

Loading

caroav commented Nov 11, 2024 •

edited

Loading