-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct IO Support #10018
Direct IO Support #10018
Conversation
21eddef
to
3464559
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand the use cases for the various property values. Could we do something simpler like:
directio=standard | always | disabled
where standard
means: if you request DIRECTIO, we’ll do it directly if we think it’s a good idea (e.g. writes are recordsize-aligned), and otherwise we'll do the i/o non-directly (we won't fail it for poor alignment). This is the default.
always
means act like DIRECTIO was always requested (may be actually direct or indirect depending on i/o alignment, won't fail for poor alignment).
disabled
means act like DIRECTIO was never requested (which is the current behavior).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple quick comments.
865bfc2
to
428962a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10018 +/- ##
===========================================
- Coverage 75.17% 61.94% -13.24%
===========================================
Files 402 260 -142
Lines 128071 73582 -54489
===========================================
- Hits 96283 45578 -50705
+ Misses 31788 28004 -3784
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
f405fed
to
a6894d1
Compare
module/zfs/dmu.c
Outdated
zio = zio_write(pio, os->os_spa, txg, bp, data, | ||
db->db.db_size, db->db.db_size, &zp, | ||
dmu_write_direct_ready, NULL, NULL, dmu_write_direct_done, dsa, | ||
ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, &zb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about bypassing zio_dva_throttle()
. Background: slides 11-16 and video from my talk at BSDCAN 2016.
This means that DIRECTIO writes will be spread out among the vdevs using the old round-robin algorithm. This could potentially result in poor performance due to allocating from the slowest / most fragmented vdev, and we could potentially make the vdevs even more imbalanced (at least in terms of performance/fragmentation). @grwilson do you have any thoughts on this? How big the impact could be, and potential ways to mitigate? Could we make this use the throttle?
a6894d1
to
04e3a35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is such a large reorganization really needed? Couldn't be things solved by more prototypes at the beginning/in the header files? I'm just asking, because this will make debuging by git blame more difficult.
Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out. With regards to I'm excited about this PR, it looks to be a solid basis for support of |
7240ecf
to
7089709
Compare
f6391b6
to
9ddf0d5
Compare
c4e226f
to
b085f0e
Compare
b085f0e
to
72f674a
Compare
static inline boolean_t | ||
zfs_dio_page_aligned(void *buf) | ||
{ | ||
return ((((unsigned long)(buf) & (PAGESIZE - 1)) == 0) ? | ||
B_TRUE : B_FALSE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is under #ifdef _KERNEL in include/sys, since PAGESIZE may be unavailable in user-space, but still present here in libspl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up doing a deep dive not this today, as I forgot why this was even the case of not putting zfs_dio_page_aligned()
under a _KERNEL
guard originally in uio_impl.h
. Way back in the day, I updated zfs_context.h
to account for including uio_impl.h
for the kernel and uio.h
for the user-space code. I just completely forgot about that. I went back and removed including uio_impl.h
where it was not necessary in the latest commit. Also, there is user-space code in ztest that uses zfs_dio_page_aligned()
. Calls to dmu_read()
->dmu_read_impl()
. This just exercises part of the Direct I/O code for reads outside of the kernel pinning user pages. If you still believe there is places that are not covered by the guards though, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original point was that user-space is not supposed to have compile-time defines for PAGESIZE
, etc, since the same user-space binary may run on systems (of just kernel configs) with different page sizes. If you look into lib/libzpool/abd_os.c
, there are completely arbitrary defines for ABD_PAGESIZE
and ABD_PAGEMASK
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. I am open to any suggestions on how to resolve the issue ASAP. Do you have an idea how to get this resolved in a beter way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hoped your answer will be: "Sure, we don't need it, since we do not implement Direct I/O in user-space, lets just stub it." But I see that you randomly set DMU_DIRECTIO
in few places ztest
, so the question looks like: "what part of Direct I/O functionality we can and want to reasonably implement in user-space and how can we give it coherent notion of PAGE*
macros in all places that really need it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how many platforms with PAGESIZE != 4096 in user-space do we have to cause troubles. From ASAP perspective I guess it might be not a show-stopper, if we do not care about ztest
there right now.
Hello everyone, I tested the direct io branch, and I found that there was no significant improvement in the scenario of 4K IOPS 100% random write. We used 5 enterprise-class NVMe hard drives with 6.4T U-2 interface, created a RAIDZ1, and then created a 1T zvol (4K block size, compressed off, compressed off). checksum off), tested by fio, IOPS in the 50,000-60,000 direct. I would like to ask if this situation is normal |
@xuyufenghz 5-wide RAIDZ1 is not a configuration to measure IOPS performance, neither to store 4KB blocks efficiently. Still 50-60K IOPS sounds low to me, but I can't say if it is an attribute of direct I/O. With such a small data blocks obviously direct I/O can't save much on memory traffic, so the question is what is your real bottleneck. |
@xuyufenghz just wanted to double check:
|
} else if (os->os_direct == ZFS_DIRECT_ALWAYS && (ioflag & O_DIRECT)) { | ||
/* | ||
* Direct I/O was requested through the direct=always, but it | ||
* is not properly PAGE_SIZE aligned. The request will be | ||
* directed through the ARC. | ||
*/ | ||
ioflag &= ~O_DIRECT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be I miss what else may set O_DIRECT
except user and this function, but I think this section should be removed. I can't guess why the fact of direct
set to always
should mean an amnesty for applications using O_DIRECT explicitly but with incorrect alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is an interesting case. With the direct
property setting being set to always
, it is a best attempt at doing acting as though O_DIRECT
was pass on every open
call. However, the main thing with the always
property setting is knowing that alignment restrictions are relaxed. By relaxed, that means that unaligned PAGE_SIZE
I/O just happily gets passed off to the ARC. This was the entire idea behind always
. EINVAL
could never be returned for misalignment. If the users want that kind of strict enforcement of alignment checks, we happily accommodate that. That is the direct
property being set to standard
. In my opinion, the semantics of the dataset property supersede the fact that O_DIRECT
is passed as flag. Otherwise, there would be no reason to even have the always
property value. It is a best attempt, but assurance that it will not fail under misalignment.
module/zfs/zfs_vnops.c
Outdated
uio->uio_extflg |= UIO_DIRECT; | ||
|
||
if (error != 0) | ||
n -= dio_remaining_resid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n -= dio_remaining_resid; | |
n += dio_remaining_resid; |
You've already subtracted dio_remaining_resid
from n
above when calculating it. So in case of error you should add it here, saying that there is more left to read, not subtract again.
Also this error handling does not replicate the error == ECKSUM
case from above. I wonder if instead of code duplication we could repeat the loop once more with little dirt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... That is was dumb logic on my part. I changed this over to +=
. I was looking at the while loop above, but I feel like the code as it stands is fine. However, if you have a better way of refactoring, I am happy to replace the code with anything you could provide.
@bwatkinson Any updates on this? IIRC BRT has per-TXG accounting (see brt_pending_tree), which allows tracking of changes done in each different TXG. DDT AFAIK has no such mechanism, relying on all changes done only in syncing context. |
@bwatkinson @amotin I thought about interaction between O_DIRECT and dedup. So yes, as currently constructed, its assumed that I think the answer might be pretty simple though: just disable dedup for O_DIRECT blocks. This is exactly what we do for ZIL blocks, which are also written outside syncing context. I'd do the same as we do there: turn off dedup, but still use a dedup-capable checksum. See As justification, I think you simply sell it it as a performance feature: the ultimate goal of O_DIRECT is to be fast? From cold, dedup requires a disk read, or a cache hit, to see if its already in the table, and then the writing block has to wait for that lookup, and then there's another write at end of txg to get the updated dedup entry to disk (DDT-ZAP or FDT-log). Performance gets worse on smaller block sizes too, which (as I understand it) is when you're likely to get most benefit from O_DIRECT? If there is a case where you want dedup and O_DIRECT to do the right thing, the workaround would just be to set I can't think of a simple way to extend dedup as it stands to support changes at arbitrary times. Probably it is some sort of per-txg list, but like I say, its pretty deeply baked in and I think it would be effectively a redesign of the entire facility to support it. I'd be surprised if there's any real need or interest. |
@robn thank you for clarifying this. Also, I just noticed I am already passing |
72f674a
to
2735552
Compare
Thank you for your reply. I confirm that the direct property is turned on. I also found that I could not see Direct IO stats increasing when testing with volumes, and I could see Direct IO stats statistics when testing with the same configuration of the file system. Does the volume support the Direct IO property |
Hi,I retested it several times and the IOPS was still very low. I generated a FlameGraph statistic, let's see what went wrong。 |
@xuyufenghz I don't know if it is the case, but I have practice to specify Also for a note: PS: Your svg link is broken. |
2735552
to
6c92501
Compare
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that contols the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module paramter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module paramter to 0, it mimics as if the direct dataset property is set to disabled. Signed-off-by: Brian Atkinson <[email protected]> Co-authored-by: Mark Maybee <[email protected]> Co-authored-by: Matt Macy <[email protected]> Co-authored-by: Brian Behlendorf <[email protected]>
6c92501
to
d7b861e
Compare
Happy to learn that this has been merged. But is this compatible with a pool that has L2ARC cache? I do not understand how it would work since the working principles of L2ARC is the caching of data in ARC that is about to be evicted. |
Thrilled to see this finally merged! Can't wait to test on my nvme pool. Thanks everyone for the great work. |
how this work in a mix pool? lets say I have a zpool with hhd, but nvme as special device. If I turn on O_DIRECT, what to happen? can it optimise write to special vdev and raidz hhd separately? |
@liyimeng mirror/stripe of NVMe and raidz of HDDs are two different worlds. You should optimize for the slowest one, i.e. allow caching as much as possible, since it will be the bottleneck, with or without direct I/O to NVMe. BTW, if special vdev is used mostly for metadata, then O_DIRECT just does not apply to it, since metadata are (almost) never written by user directly. |
Adding O_DIRECT support to ZFS.
Motivation and Context
By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes.
There are certain cases where caching data in the ARC can decrease overall performance.
In particular the performance of ZPool's composed of NVMe devices displayed poor read/write
performance due to the extra overhead of memcpy's issued to the ARC.
There are also cases where caching in the ARC may not make sense such as when data
will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the
ARC can be avoided.
Closes Issue: #8381
Description
O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests.
This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just
as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will
not be synced until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned.
In the event they are not, then
EINVAL
is returned except for in the event the direct property is set to always.For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path.
In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded
from the ARC forcing all further reads to retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered
(in the ARC) it will just be copied from the ARC into the user buffer.
To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one
of the following is required:
Checksum
Compression
Encryption
Parity
By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after
any of the above operations have taken place.
A new dataset property
direct
has been added with the following 3allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.
Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still
all supported with Direct IO.
Some issues that still need to be addressed:
How Has This Been Tested?
Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.
Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.
Types of changes
Checklist:
Signed-off-by
.