-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: px driver enhancements #216
base: ln/iouring
Are you sure you want to change the base?
Conversation
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using extra threads for this is an overkill. While it may increase the performance it does not justify the extra cpu cost for spinning kernel threads (which basically removes the core from processing any other work).
The amount of the work done here is pretty minimal, it is copy data and call bio_done() for the read request and call bio_done() for write request. 8 threads is way too much overhead for the work performed.
What happens if the response is done directly in the original user thread context without going through the response queue as in the original code?
We should get rid of thus driver specific response queue anyway while we can to avoid backward compatibility issues. Any work done there should be merged with the iouring queue. The read response eventually would be responded from the network code directly for the remote reads and from the io completion for local requests.
1/ We are already looking at perf regression in IO path and it does not scale. This changeset tries to remove the bottlenecks. 2/ The amount of work may be minimal, but the number of requests will scale as more devices get attached.
3/ I had already tried that approach. It still does not scale as expected, with 2.8 as baseline and ctrl reply as target, see below.
4/ Any merging of response path with a iouring is beyond the scope of this patch. Also my understanding is read responses do carry more information than data (like checksum and timestamps). So it has to be conditional and more logic from userspace needs to get exposed to get this path ready. For writes again, there may be replicated volumes whose response/quorum etc logic needs to come in driver code before response. This is not the focus of this PR. |
Could you please explain why the same code scales for 2.8 and does not scale for the master? What is the difference? |
Two significant reasons why 2.8 does better in some profiles. 2/ master as is, tries to service all requests in a single thread from userspace in a tight loop (run_queue). |
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
The question is why taking the same code path as 2.8 to reply to requests does not scale in master? |
How is it the same code path? In 2.8, when one looks at the complete IO path (still simplified) vs Now completing IO like 2.8 from master, still does not match 2.8. Also, in 2.8 new requests are fetched only when the threadpool has no new requests. |
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment.
|
||
} | ||
|
||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 11, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrappers to control userspace locking.
if (fc->io_worker_thread[i]) kthread_unpark(fc->io_worker_thread[i]); | ||
} | ||
|
||
int fuse_conn_init(struct fuse_conn *fc, uint32_t max_workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'max_workers's argument gets passed part of initialization. This is driven through a module param. Default is zero.
|
||
module_param(pxd_num_contexts_exported, uint, 0644); | ||
module_param(pxd_num_contexts, uint, 0644); | ||
module_param(pxd_detect_zero_writes, uint, 0644); | ||
/// specify number of threads for bgio processing | ||
module_param(pxd_offload, uint, 0644); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above controls the number of kernel threads for offloading req processing in the kernel. By default disabled.
static inline | ||
int pxd_supported_features(void) | ||
{ | ||
int features = 0; | ||
#ifdef __PX_FASTPATH__ | ||
features |= PXD_FEATURE_FASTPATH; | ||
#endif | ||
if (pxd_offload_threads()) features |= PXD_FEATURE_BGIO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kernel offloading support is exported to userspace as a feature flag.
What this PR does / why we need it:
Second part of master branch performance improvement.
This patch focusses on improving IO perf between px driver and the userspace, both ways.
Performance has improved manifold across the board.
Which issue(s) this PR fixes (optional)
Closes #
https://portworx.atlassian.net/browse/PWX-20971
Special notes for your reviewer:
BVT btrfs https://jenkins.portworx.dev/job/DEV/job/Porx-02/758/ PASS