perf: px driver enhancements #216

sulakshm · 2021-07-06T06:22:22Z

What this PR does / why we need it:
Second part of master branch performance improvement.
This patch focusses on improving IO perf between px driver and the userspace, both ways.

Performance has improved manifold across the board.

Which issue(s) this PR fixes (optional)
Closes #
https://portworx.atlassian.net/browse/PWX-20971

Special notes for your reviewer:
BVT btrfs https://jenkins.portworx.dev/job/DEV/job/Porx-02/758/ PASS

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

maxkozlovsky

Using extra threads for this is an overkill. While it may increase the performance it does not justify the extra cpu cost for spinning kernel threads (which basically removes the core from processing any other work).

The amount of the work done here is pretty minimal, it is copy data and call bio_done() for the read request and call bio_done() for write request. 8 threads is way too much overhead for the work performed.

What happens if the response is done directly in the original user thread context without going through the response queue as in the original code?

We should get rid of thus driver specific response queue anyway while we can to avoid backward compatibility issues. Any work done there should be merged with the iouring queue. The read response eventually would be responded from the network code directly for the remote reads and from the io completion for local requests.

sulakshm · 2021-07-07T04:25:14Z

1/ We are already looking at perf regression in IO path and it does not scale. This changeset tries to remove the bottlenecks.
The added resources are active only when IOs are to be serviced.

2/ The amount of work may be minimal, but the number of requests will scale as more devices get attached.
Also, px need not be involved in this phase of completion. This change takes complete ownership of completing requests away from the px userspace process.

What happens if the response is done directly in the original user thread context without going through the response queue as in the original code?

3/ I had already tried that approach. It still does not scale as expected, with 2.8 as baseline and ctrl reply as target, see below.
Taking randwrites out, "all ctrl reply" is 3 above par, 3 below par profiles,
while "posted patch" is 5 above par, 1 below (1 vol seq write).

KIOPS 

nvol=1     | 2.8 | master | all ctrl reply | perf patch
write        | 172 | 115 | 125 | 113  (-par)
read         | 246 | 246 | 328 | 325  (+par)
randwrite | 87 | 68 | 79 | 67         (-par - marginal)
randread  | 137 | 93 | 85 | 278    (--par)

nvol=40    | 2.8 | master | master all ctrl reply | perf patch
write          | 234 | 312 | 325 | 422 (+par)
read          | 420 | 184 | 314 | 515 (--par)
randwrite  | 95 | 81 | 70 | 82   (-par marginal)
randread   | 62 | 73 | 111 | 506 (+par)

4/ Any merging of response path with a iouring is beyond the scope of this patch. Also my understanding is read responses do carry more information than data (like checksum and timestamps). So it has to be conditional and more logic from userspace needs to get exposed to get this path ready. For writes again, there may be replicated volumes whose response/quorum etc logic needs to come in driver code before response. This is not the focus of this PR.

maxkozlovsky · 2021-07-08T02:29:10Z

I had already tried that approach. It still does not scale as expected

Could you please explain why the same code scales for 2.8 and does not scale for the master? What is the difference?

sulakshm · 2021-07-08T05:00:08Z

I had already tried that approach. It still does not scale as expected

Could you please explain why the same code scales for 2.8 and does not scale for the master? What is the difference?

Two significant reasons why 2.8 does better in some profiles.
1/ there is a higher percentage of merges happening at the physical drive end. This more than compensates for the slow path in driver and target side IO paths.

2/ master as is, tries to service all requests in a single thread from userspace in a tight loop (run_queue).
This does not scale. The same is the reason with the driver end as well.

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

maxkozlovsky · 2021-07-09T01:01:28Z

The question is why taking the same code path as 2.8 to reply to requests does not scale in master?

sulakshm · 2021-07-09T09:04:37Z

How is it the same code path?

In 2.8, when one looks at the complete IO path (still simplified)
a) ctrl device fetch,
b) io scheduling at coordinator
c) reach target through messenger
d) kaio module for IO at target
e) ctrl device response to complete IO.

vs
in master
a) memory mapped IO fetch
b) io scheduling at coordinator.
c) reach target through messenger
d) uring module for IO at target
e) memory mapped IO completion.

Now completing IO like 2.8 from master, still does not match 2.8.
Additionally, master has direct IO for random reads.

Also, in 2.8 new requests are fetched only when the threadpool has no new requests.
In master, new requests are fetched all the time.

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

sulakshm

added comment.

sulakshm · 2021-07-30T05:27:07Z

dev.c

+
+}
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 11, 0)


wrappers to control userspace locking.

sulakshm · 2021-07-30T05:27:56Z

dev.c

+		if (fc->io_worker_thread[i]) kthread_unpark(fc->io_worker_thread[i]);
+}
+
+int fuse_conn_init(struct fuse_conn *fc, uint32_t max_workers)


'max_workers's argument gets passed part of initialization. This is driven through a module param. Default is zero.

sulakshm · 2021-07-30T05:28:25Z

pxd.c


 module_param(pxd_num_contexts_exported, uint, 0644);
 module_param(pxd_num_contexts, uint, 0644);
 module_param(pxd_detect_zero_writes, uint, 0644);
+/// specify number of threads for bgio processing
+module_param(pxd_offload, uint, 0644);


the above controls the number of kernel threads for offloading req processing in the kernel. By default disabled.

sulakshm · 2021-07-30T05:29:07Z

pxd.h

 static inline
 int pxd_supported_features(void)
 {
 	int features = 0;
 #ifdef __PX_FASTPATH__
 	features |= PXD_FEATURE_FASTPATH;
 #endif
+	if (pxd_offload_threads()) features |= PXD_FEATURE_BGIO;


kernel offloading support is exported to userspace as a feature flag.

Lakshmi Narasimhan Sundararajan added 4 commits July 2, 2021 12:25

implement bg processing at driver

72ed8fb

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

optimize

ccd951a

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

extend features for bgio

eb98f5a

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

merge code

34bf7c4

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

sulakshm requested review from prabirpaul and maxkozlovsky July 6, 2021 06:22

sulakshm changed the title ~~Ln/fuseopt~~ perf: px driver enhancements Jul 6, 2021

maxkozlovsky reviewed Jul 6, 2021

View reviewed changes

fix compilation

9b27da9

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

Lakshmi Narasimhan Sundararajan added 9 commits July 20, 2021 12:53

extend support to configure/disable offload

73bd01f

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

reset default

15d981b

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

uring sync req

a50a276

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

fix compile

021c68e

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

extend init log

dc853d4

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

cache sqe and kill kmalloc

5eaaef3

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

minor edits

1a25707

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

Merge branch 'ln/iouring' into ln/fuseopt

db3a9d9

default disable offload

ed2c058

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>

sulakshm commented Jul 30, 2021

View reviewed changes

Lakshmi Narasimhan Sundararajan added 2 commits August 2, 2021 10:58

Merge branch 'ln/iouring' into ln/fuseopt

5886a08

Merge branch 'ln/iouring' into ln/fuseopt

68d3965

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: px driver enhancements #216

perf: px driver enhancements #216

sulakshm commented Jul 6, 2021 •

edited

Loading

maxkozlovsky left a comment

sulakshm commented Jul 7, 2021 •

edited

Loading

maxkozlovsky commented Jul 8, 2021

sulakshm commented Jul 8, 2021

maxkozlovsky commented Jul 9, 2021

sulakshm commented Jul 9, 2021

sulakshm left a comment

sulakshm Jul 30, 2021

sulakshm Jul 30, 2021

sulakshm Jul 30, 2021

sulakshm Jul 30, 2021

perf: px driver enhancements #216

Are you sure you want to change the base?

perf: px driver enhancements #216

Conversation

sulakshm commented Jul 6, 2021 • edited Loading

maxkozlovsky left a comment

Choose a reason for hiding this comment

sulakshm commented Jul 7, 2021 • edited Loading

maxkozlovsky commented Jul 8, 2021

sulakshm commented Jul 8, 2021

maxkozlovsky commented Jul 9, 2021

sulakshm commented Jul 9, 2021

sulakshm left a comment

Choose a reason for hiding this comment

sulakshm Jul 30, 2021

Choose a reason for hiding this comment

sulakshm Jul 30, 2021

Choose a reason for hiding this comment

sulakshm Jul 30, 2021

Choose a reason for hiding this comment

sulakshm Jul 30, 2021

Choose a reason for hiding this comment

sulakshm commented Jul 6, 2021 •

edited

Loading

sulakshm commented Jul 7, 2021 •

edited

Loading