Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: px driver enhancements #216

Open
wants to merge 16 commits into
base: ln/iouring
Choose a base branch
from
Open

perf: px driver enhancements #216

wants to merge 16 commits into from

Conversation

sulakshm
Copy link
Contributor

@sulakshm sulakshm commented Jul 6, 2021

What this PR does / why we need it:
Second part of master branch performance improvement.
This patch focusses on improving IO perf between px driver and the userspace, both ways.

Performance has improved manifold across the board.

Which issue(s) this PR fixes (optional)
Closes #
https://portworx.atlassian.net/browse/PWX-20971

Special notes for your reviewer:
BVT btrfs https://jenkins.portworx.dev/job/DEV/job/Porx-02/758/ PASS

image

Lakshmi Narasimhan Sundararajan added 4 commits July 2, 2021 12:25
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
@sulakshm sulakshm changed the title Ln/fuseopt perf: px driver enhancements Jul 6, 2021
Copy link

@maxkozlovsky maxkozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using extra threads for this is an overkill. While it may increase the performance it does not justify the extra cpu cost for spinning kernel threads (which basically removes the core from processing any other work).

The amount of the work done here is pretty minimal, it is copy data and call bio_done() for the read request and call bio_done() for write request. 8 threads is way too much overhead for the work performed.

What happens if the response is done directly in the original user thread context without going through the response queue as in the original code?

We should get rid of thus driver specific response queue anyway while we can to avoid backward compatibility issues. Any work done there should be merged with the iouring queue. The read response eventually would be responded from the network code directly for the remote reads and from the io completion for local requests.

@sulakshm
Copy link
Contributor Author

sulakshm commented Jul 7, 2021

1/ We are already looking at perf regression in IO path and it does not scale. This changeset tries to remove the bottlenecks.
The added resources are active only when IOs are to be serviced.

2/ The amount of work may be minimal, but the number of requests will scale as more devices get attached.
Also, px need not be involved in this phase of completion. This change takes complete ownership of completing requests away from the px userspace process.

What happens if the response is done directly in the original user thread context without going through the response queue as in the original code?

3/ I had already tried that approach. It still does not scale as expected, with 2.8 as baseline and ctrl reply as target, see below.
Taking randwrites out, "all ctrl reply" is 3 above par, 3 below par profiles,
while "posted patch" is 5 above par, 1 below (1 vol seq write).

KIOPS 

nvol=1     | 2.8 | master | all ctrl reply | perf patch
write        | 172 | 115 | 125 | 113  (-par)
read         | 246 | 246 | 328 | 325  (+par)
randwrite | 87 | 68 | 79 | 67         (-par - marginal)
randread  | 137 | 93 | 85 | 278    (--par)

nvol=40    | 2.8 | master | master all ctrl reply | perf patch
write          | 234 | 312 | 325 | 422 (+par)
read          | 420 | 184 | 314 | 515 (--par)
randwrite  | 95 | 81 | 70 | 82   (-par marginal)
randread   | 62 | 73 | 111 | 506 (+par)

4/ Any merging of response path with a iouring is beyond the scope of this patch. Also my understanding is read responses do carry more information than data (like checksum and timestamps). So it has to be conditional and more logic from userspace needs to get exposed to get this path ready. For writes again, there may be replicated volumes whose response/quorum etc logic needs to come in driver code before response. This is not the focus of this PR.

@maxkozlovsky
Copy link

I had already tried that approach. It still does not scale as expected

Could you please explain why the same code scales for 2.8 and does not scale for the master? What is the difference?

@sulakshm
Copy link
Contributor Author

sulakshm commented Jul 8, 2021

I had already tried that approach. It still does not scale as expected

Could you please explain why the same code scales for 2.8 and does not scale for the master? What is the difference?

Two significant reasons why 2.8 does better in some profiles.
1/ there is a higher percentage of merges happening at the physical drive end. This more than compensates for the slow path in driver and target side IO paths.

2/ master as is, tries to service all requests in a single thread from userspace in a tight loop (run_queue).
This does not scale. The same is the reason with the driver end as well.

Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
@maxkozlovsky
Copy link

The question is why taking the same code path as 2.8 to reply to requests does not scale in master?

@sulakshm
Copy link
Contributor Author

sulakshm commented Jul 9, 2021

How is it the same code path?

In 2.8, when one looks at the complete IO path (still simplified)
a) ctrl device fetch,
b) io scheduling at coordinator
c) reach target through messenger
d) kaio module for IO at target
e) ctrl device response to complete IO.

vs
in master
a) memory mapped IO fetch
b) io scheduling at coordinator.
c) reach target through messenger
d) uring module for IO at target
e) memory mapped IO completion.

Now completing IO like 2.8 from master, still does not match 2.8.
Additionally, master has direct IO for random reads.

Also, in 2.8 new requests are fetched only when the threadpool has no new requests.
In master, new requests are fetched all the time.

Lakshmi Narasimhan Sundararajan added 9 commits July 20, 2021 12:53
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Signed-off-by: Lakshmi Narasimhan Sundararajan <[email protected]>
Copy link
Contributor Author

@sulakshm sulakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment.


}

#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 11, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrappers to control userspace locking.

if (fc->io_worker_thread[i]) kthread_unpark(fc->io_worker_thread[i]);
}

int fuse_conn_init(struct fuse_conn *fc, uint32_t max_workers)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'max_workers's argument gets passed part of initialization. This is driven through a module param. Default is zero.


module_param(pxd_num_contexts_exported, uint, 0644);
module_param(pxd_num_contexts, uint, 0644);
module_param(pxd_detect_zero_writes, uint, 0644);
/// specify number of threads for bgio processing
module_param(pxd_offload, uint, 0644);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above controls the number of kernel threads for offloading req processing in the kernel. By default disabled.

static inline
int pxd_supported_features(void)
{
int features = 0;
#ifdef __PX_FASTPATH__
features |= PXD_FEATURE_FASTPATH;
#endif
if (pxd_offload_threads()) features |= PXD_FEATURE_BGIO;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel offloading support is exported to userspace as a feature flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants