-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAIDZ Expansion feature #12225
RAIDZ Expansion feature #12225
Conversation
Congrats on the progress! |
Also congrats on moving this out of alpha. I have a question regarding the
section and it might help to do a little example here: If I have a 8x1TB RAIDZ2 and load 2 TB into it, it would be 33% full. In comparison, if I start with 4x1TB RAIDZ2 which is full and expand it to 8x1TB RAIDZ2, it would be 50% full. Is that understanding correct? If so, it would mean that one should always start an expansion with as empty a vdev as possible. As this will not always be an option, is there any possibility (planned) to rewrite the old data into the new parity? Would moving the data off the vdev and then back again do the job? |
Moving the data off and back on again should do it I think, since it rewrites the data (snapshots might get in the way of it). |
@cornim I think that math is right - that's one of the worst cases (you'd be better off starting with mirrors than a 4-wide RAIDZ2; and if you're doubling your number of drives you might be better off adding a new RAIDZ group). To take another example, if you have a 5-wide RAIDZ1, and add a disk, you'll still be using 1/5th (20%) of the space as parity, whereas newly written blocks will use 1/6th (17%) of the space as parity - a difference of 3%. @GurliGebis Rewriting the blocks would cause them to be updated to the new data:parity ratio (e.g. saving 3% of space in the 5-wide example). Assuming there are no snapshots, copying every file, or dirtying every block (e.g. reading the first byte and then writing the same byte back) would do the trick. If there are snapshots (and you want to preserve their block sharing), you could use |
It might be helpful to state explicitly whether extra space becomes available during expansion or only after expansion is completed. |
@stuartthebruce Good point. This is mentioned in the commit message and PR writeup:
|
If it is not to pedantic, how about, "additional space...only after the expansion completes." The current wording leaves open the possibility that space might become available incrementally during expansion. |
One question, if i add to a 10 dirves pool 5 more with this system the 5 new ones have the new parity, and after that the old 10 ones are replaced step by step with replace command, at the end when the 10 old ones are replaced all the raid will have the new parity? when replace the old ones the parity keeps or the new parity is used, so we can revcoer extra space. |
@felisucoibi I'm not sure I totally understand your question, but here's an example that may be related to yours: If you start with a 10-wide RAIDZ1 vdev, and then do 5 |
Thanks for the answer. So the only way to recalculate the old blocks is to rewrite the data like you suggested. |
First of all, this is an awesome feature, thank you. If I may ask: why aren't the old blocks rewritten to reclaim some extra space? I can imagine that redistributing data only affects a smaller portion of all data and thus is faster but the user then still has to rewrite data to reclaim storage space. Would be nice if this can be done as part of the expansion process as an extra option if people are willing to accept the extra time required? For what it's worth. |
ZFS has a philosophy of “don’t mess with what’s already on disk if you can avoid it. If need be go to extremes to not mess with what’s been written (memory mapping removed disks in pools of mirrors for example)”. Someone who wants old data rewritten can make that choice and send/recv, which is an easy operation. I like the way this works now. |
@louwrentius I'm of the same mind - coming from the other direction though, and given the code complexity involved, I was wondering if perhaps the data redistribution component was maybe going to be listed as a subsequent PR...? I'd looked for an existing one in the event it was already out there, but it could be something that's already thought of/planned and I just wasn't able to locate it. Given the number of components involved and the complexity of the operations that'd be necessary, especially as it'd pertain to memory and snapshots, I could see it making sense to split the tasks up. I'm imagining something like -
To me at least, the more I think about this, the more sense it'd make to have a 'pool level' rebalance/redistribution, as all existing data within a pool's vdevs is typically reliant upon one another. It'd certainly seem to simplify things compared to what I'm describing above I'd think. It also helps to solve other issues which've been longstanding, especially as it relates to performance of long lived pools, which may've had multiple vdevs added over time as the existing pool became full. Anyway, I don't want to ramble too much - I could just see how it'd make sense at least at some level to have data redistribution be another PR. |
Having an easily accessible command to rewrite all the old blocks, or preferably an option to do so as part of the expansion process would be greatly appreciated. |
@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:
All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the discussion forum or in a new feature request. Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or |
I'd just like to state I greatly GREATLY appreciate the work your doing. Frankly, this is one of the things holding lots of people back from using ZFS, and having the ability to grow a zfs pool without having to add vdevs will be literally magical. I will be able to easily switch back to ZFS after this is complete. Again THANK YOU VERY MUCH!! |
@ahrens As someone silently following the progress since the original PR, I also want to note that I really appreciate all the effort and commitment you put and are putting into this feature! I believe once this lands, it'll be a really valuable addition to zfs :) Thank you |
@kellerkindt @Jerkysan @Evernow Thanks for the kind words! It really makes my day to know that this work will be useful and appreciated! ❤️ 😄 |
Thanks for your work on this feature. It's exciting to finally see some progress in this area, and it will be useful for many people once released. Do these changes lay any groundwork for future support for adding a parity disk (instead of a data disk - i.e., increasing the RAID-Z level)? Meaningfully growing the number of disks in an existing array would likely trigger a desire to increase the fault tolerance level as well. Since the existing data is just redistributed, I understand that the old data would not have the increased redundancy unless rewritten. But I am still curious if your work that allows supporting old/new data+parity layouts simultaneously in a pool could also apply to increasing the number of parity disks (and algorithm) for future writes. |
I'm also incredibly stoked for this issue, and understand the decision to separate reallocation of existing data into another FR. Thanks so much for all of the hard work that went into this. I can't wait to take advantage of it. After performing a raidz expansion is there at least an accurate mechanism to determine which objects (files, snapshots, datasets....I admit I haven't fully wrapped my head around how this impacts things, so apologies for not using the correct terms) map to blocks with the "old" data-to-parity ratio and possibly calculate the resulting space increase? I imagine many administrators desire a balance between maximizing space, benefitting from all of the benefits of ZFS (checksumming, deduplication, etc), and the flexibility of expanding storage (yes, we want to eat all the cakes) and will naturally compare this feature to other technologies, such as md raid, where growing a raid array triggers a recalculation of all parity. As such, these administrators will want to be able to plan out how to do the same on a zpool with an expanded raidz vdev without just blindly rewriting all files. |
Yes, that's right. This work allows adding a disk, and future work could be to increase the parity. As you mentioned, the variable, time-based geometry scheme of RAIDZ Expansion could be leveraged to know that old blocks have the old amount of parity, and new blocks have the new amount of parity. That work would be pretty straightforward. However, the fact that the old blocks remain with the old amount of failure tolerance means that overall you would not be able to tolerate an increased number of failures, until all the old blocks have been freed. So I think that practically, it would be important to have a mechanism to at least observe the amount of old blocks, and probably also to reallocate the old blocks. Otherwise you won't actually be able to tolerate any more failures without losing [the old] data. This would be challenging to implement in the general case but as mentioned above there are some OK solutions for special cases. |
There isn't an easy way to do this, but you could get this information out of |
@ahrens I also wish to thank you for the awesome work! I have some questions:
Thanks. |
curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc.. |
The answer is "we settled"... I settled for unraid though until that point I had run ZFS for years and years. I could no longer afford to buy enough drives all at once to build a ZFS box outright. I had life obligations that I had to meet while wanting to continue to my hobby that seemingly never stops expanding in cost frankly. I needed something that could grow with me. It just simply doesn't give me the compressing file system, snapshots, and yada yada. Basically, I had to make sacrifices to continue my hobby without literally breaking the bank. This functionality will allow me to go back to ZFS. I don't care if I have to move all the files to get them to use the new drives I'm adding and such. It's not "production critical" but I do want all the "nice things" that ZFS offers. This project will literally "give me my cake and let me eat it to". I've been waiting since it was first announced years ago and hovering over the searches waiting to see a new update. I'm already trying to figure out how I'm going to slide into this. |
@kocoman1 Is that on Linux, or did you get it to compile on macOS? 😄 |
To answer this question directly - no. More work and testing is required. I'd say we're at least 12 months out. Come back in March 2024 and see how things are going. |
How hard is to jump into the project for new devs? |
I didn't try anymore in osx (it was in linux) after I could not even get an ls of what files I lost. The sas2008 does work with external kext on ventura ok. I added extfs/apfs but it reboots sometimes during load. not sure where the bug is, plus the unmount is take so long that it panics when I shut the machine down in osx plus I have a bunch of STDM3000 and 2.5 4000 series SMR that are dying. |
So I would love RaidZ expansion for a personal project, and am hoping to try it. I pulled the PR branch and tried to build from source, but the branch was too out-of-date to build against my current Ubuntu version (most recent public release). So instead I am considering the following:
But I am missing the context to know if this is a good idea. So a few questions:
I don't have the spare time to pick up a new codebase to help with rebasing the branch, so I might as well make myself a guinea pig next time I've got some free time |
The level of risk is directly proportionate to your ability to restore the data from a backup. |
I am a novice user of FreeBSD and not so long ago I started using ZFS for my experiments. I am looking forward to the appearance of this functionality! Many thanks for your work! |
I had data loss after expanding it like 5 times(expand, copy some files from the ones to erase and expand on...etc), then when I read the data I get Zfs panic and any ls etc just hangs, I was able to mount via readonly, but doing ls in some directories results just in error. copying back gets checksum and io abort error.. so its not ready I think unless they fixed something. |
@kocoman1 Could you provide some more detailed error description? This might help pinpoint the issue. At minimum:
|
Its hidden in "kocoman1 commented on Sep 29, 2021 " VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768) [Wed Sep 29 22:13:28 2021] Call Trace: [Wed Sep 29 22:13:28 2021] show_stack+0x52/0x58 |
Is there anything specific volunteers could do to aid in testing this feature? What would help the most? Also, would it be possible to rebase this branch on master? There is a great number of merge conflicts right now. This makes it a lot more difficult to test this branch. Also, I am suspecting that it will render any tests that are being conducted more meaningless, since features already present in upstream won't be tested together with RAIDZ expansion. In other words, the more outdated this branch becomes, the less sense it makes to use it or to rely on it. |
can try what I did that cause the error have a bunch of data that can be lost initally create a raidz-expand supported zfs with raidz2 (for mines anyway) with the min 3 (?) drives, then add another drive, copy more data in till full(most important), rinse and repeat till reach 10 drives to see if any error during the procress.. can also scrub after each expand |
Just to point out for anyone that would like to test this, but doesn't have much spare hardware... testing this in virtual machines is a workable idea. eg virtual machine with virtual disks. Use them to simulate adding more disks, pulling the plug on some while it's doing stuff etc. My problem is lack of time currently. 👼 |
One of the great powers of ZFS has been the ability to supply fake file-backed disks, used in testing. Might still bring down the kernel though if not in a VM. |
Realistically I don't think the bottleneck is the lack of testing but rather the lack of code reviews (AFAIK except for a couple of PRs there isn't any other kind of feedback to address) and mostly @ahrens himself being busy with other stuff (the previously mentioned PRs are pending since more than a year). So I suggest to either wait until his priorities align with ours or be prepared to contribute in some meaningful way. This feature was teased as possibly being addressed in the next major version of zfs (I don't remember which company was interested in it) but honestly I'm starting to highly doubt it. |
is there a tutorial for that? I want to combine a 1 and 2tb drive to make 3tb to replace a failling 3tb |
You can supply two drives as toplevel vdevs and ZFS will essentially stripe across them. If either drive (1TB or 2TB) fails you will loose all the data though, and any errors will be uncorrectable in this configuration. Please ask support questions unrelated to this feature on a public forum like https://www.reddit.com/r/openzfs/ |
Waiting this feature to be merged since the beginning. Seems freebsd guys are anouncing this as finished (one year ago) |
Please read the thread before replying. As per the comment three comments before yours:
This work is functional but requires both cleanup and code reviews. Other people have also experienced bugs during recent testing. In order for this to be done there has to be time for @/ahrens and others to work on it, but they are very busy with other priorities. There are a lot of people subscribed to this issue and it is a waste of everyone's time to post about things which have already been answered. |
I try to believe in open-source projects with all my will, but situations like in this PR create great demotivation. Many projects and sometimes important features depending on one specific person that is not continuing what they started, creating false hope to users. I feel very frustrated and axious (maybe not only me). The author last commited code on 25 Feb 2022, more than one year ago, then we got some changes from @fuporovvStack and last activity was on 16 Nov 2023, almost 6 months ago. Also the maintainers and developers do not communicate anything about intensions or status. It looks to me there is no intension to continue it. I feel sad that this project does not seem sustainable, commited and respecting, even with hundreds of interested people in it. For me this feature is stale/abandoned, I will better invest my time finding an alternative FS that fullfills my requirements, if someone else feels the same ways as me we can try to find together a better solution, since I don't have enough knowledge of internals of this project to fix the code of zfs (and it seems people that have it do not really want it). |
May I suggest adding this to the agenda then on the leadership meeting on the 25th of April (5 days from now)? Details here: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit |
Could someone try and find a solution to limit spam in this PR? Most of the comments are just noise. (including mine, sorry!) |
Recording of said leadership meeting: https://youtu.be/sZJMFvjqXvE?t=1490 |
Thanks for giving us an update in that meeting @behlendorf Here's the transcript from youtube:
|
It happening ? |
Pleased to announce that iXsytems is sponsoring the efforts by @don-brady to get this finalized and merged. Thanks to @don-brady and @ahrens for discussing this on the OpenZFS leadership meeting today. Looking forward to an updated PR soon. |
@don-brady has taken over this work and opened #15022. Thanks Don, and thanks iXsystems for sponsoring his work! See the June 2023 OpenZFS Leadership Meeting for a brief discussion of the transition. |
Motivation and Context
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful for
small pools (typically with only one RAID-Z group), where there isn't
sufficient hardware to add capacity by adding a whole new RAID-Z group
(typically doubling the number of disks).
For additional context as well as a design overview, see my talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.
Description
Initiating expansion
A new device (disk) can be attached to an existing RAIDZ vdev, by running
zpool attach POOL raidzP-N NEW_DEVICE
, e.g.zpool attach tank raidz2-0 sda
.The new device will become part of the RAIDZ group. A "raidz expansion" will
be initiated, and the new device will contribute additional space to the RAIDZ
group once the expansion completes.
The
feature@raidz_expansion
on-disk feature flag must beenabled
toinitiate an expansion, and it remains
active
for the life of the pool. Inother words, pools with expanded RAIDZ vdevs can not be imported by older
releases of the ZFS software.
During expansion
The expansion entails reading all allocated space from existing disks in the
RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including
the newly added device).
The expansion progress can be monitored with
zpool status
.Data redundancy is maintained during (and after) the expansion. If a disk
fails while the expansion is in progress, the expansion pauses until the health
of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting
for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
After expansion
When the expansion completes, the additional space is avalable for use, and is
reflected in the
available
zfs property (as seen inzfs list
,df
, etc).Expansion does not change the number of failures that can be tolerated without
data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old data-to-parity
ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the
larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not
change, so slightly less space than is expected may be reported for
newly-written blocks, according to
zfs list
,df
,ls -s
, and similartools.
Manpage changes
zpool-attach.8:
Status
This feature is believed to be complete. However, like all PR's, it is subject
to change as part of the code review process. Since this PR includes on-disk
changes, it shouldn't be used on production systems before it is integrated to
the OpenZFS codebase. Tasks that still need to be done before integration:
Acknowledgments
Thank you to the FreeBSD Foundation for
commissioning this work in 2017 and continuing to sponsor it well past our
original time estimates!
Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions
of the implementation.
Sponsored-by: The FreeBSD Foundation
Contributions-by: Stuart Maybee [email protected]
Contributions-by: Fedor Uporov [email protected]
Contributions-by: Thorsten Behrens [email protected]
Contributions-by: Fmstrat [email protected]
How Has This Been Tested?
Tests added to the ZFS Test Suite, in addition to manual testing.
Types of changes
Checklist:
Signed-off-by
.