Fast Dedup #15896

allanjude · 2024-02-14T14:28:04Z

allanjude
Feb 14, 2024
Collaborator

Fast Dedup Review Guide

Hello, dear reviewer, and welcome to the “Fast Dedup” project, brought to you by Klara and iXsystems.

This discussion stands as an overview of the entire project and as a kind of guide for reviewers. We hope it’s useful!

Overview

“Fast Dedup” is an umbrella project for a significant upgrade of the original OpenZFS block deduplication system. It’s composed of multiple logical changes:

Cleanup and documentation of the existing dedup code, creating a good base to build from Fast Dedup: Cleanup and documentation ahead of integrating Fast Dedup #15887
ZAP shrinking: allows ZAPs of all kinds (dedup and others) to reclaim some of their space after a large number of entries are deleted Fast Dedup: ZAP Shrinking #15888
Dedup quota: allows the operator to set a maximum size on dedup tables, which when reached will stop creating new entries, converting dedup writes for new blocks into regular writes. Fast Dedup: Dedup Quota #15889
Dedup prefetch: adds a new zpool prefetch command that loads dedup tables into the ARC, improving performance from cold. Fast Dedup: DDT Prefetch #15890
Table container format: allows a dedup table to be composed of multiple different kinds of objects (rather than simply ZAPs), and be self-configuring, allowing some kinds of new features to be added in the future without needing to discard existing dedup tables. Fast Dedup: Introduce the FDT on-disk format and feature flag #15892
“Flat” entry format: a new smaller data format for in-memory and on-disk table entries. Fast Dedup: “flat” DDT entry format #15893
FDT Storage Class: Ensure the fast dedup data is able to use the dedup vdev class. Fast Dedup: dnode: allow storage class to be overridden by object type #15894
Dedup log: adds a journal to a dedup table, allowing fast updates and vastly reducing IO and memory overhead of the overall dedup system. Fast Dedup: FDT-log feature #15895
Dedup prune: Adds the ability to remove older entries from the UNIQUE dedup table to allow continued use of dedup under the Dedup quota feature. Fast Dedup: prune unique entries #16277

These features are designed to peacefully co-exist with the original dedup system. A pool with an existing dedup table will continue to work exactly the same as it always has with a Fast Dedup-capable build of OpenZFS. If the fast_dedup feature is enabled on such a pool, new dedup tables will be created with all Fast Dedup features available, but the old ones will continue to work as they always have.

Trying it

Warning

Do not use this code on a production pool. The on-disk format changes are not yet finalized or stable, and are not compatible with stable OpenZFS releases, and the compatibility code for traditional dedup tables may not be stable.

The whole combined FDT code is on the fdt-rel branch of the KlaraSystems/zfs repository:

$ git clone -b fdt-rel https://github.com/KlaraSystems/zfs.git

Once built and running, enable the fast_dedup feature to use it.

Using it should be exactly the same: enable the dedup= option on a new dataset, and you get transparent block-level deduplication as before. It should just be more efficient.

Standard dedup-related inspection tools like zpool status -D... and zdb -D... should work the same as before, just show more kinds of dedup objects, and different sizings.

New tools are available to invoke the prefetch and prune features:

zpool prefetch -t ddt <pool>
zpool ddtprune <pool>

These are documented in zpool-prefetch(8) and zpool-ddtprune(8).

There is are some new pool properties:

dedupcached
dedup_table_quota
dedup_table_size

These are documented in zpoolprops(7)

There’s a collection of new kstats in the pool kstats, eg /proc/spl/kstat/zfs/tank/ddt_stats_sha256:

21 1 0x01 17 4624 6674871945 616054344464
name                            type data
lookup                          4    39
lookup_new                      4    2
lookup_existing                 4    7
lookup_live_hit                 4    30
lookup_live_wait                4    0
lookup_live_miss                4    9
lookup_log_hit                  4    7
lookup_log_active_hit           4    7
lookup_log_flushing_hit         4    0
lookup_log_miss                 4    2
lookup_stored_hit               4    0
lookup_stored_miss              4    2
log_active_entries              4    0
log_flushing_entries            4    0
log_ingest_rate                 4    0
log_flush_rate                  4    0
log_flush_time_rate             4    0

There’s also a collection of new tuneables:

dmu_ddt_copies
zfs_dedup_log_flush_rounds_max
zfs_dedup_log_flush_min_time_ms
zfs_dedup_log_flush_entries_min
zfs_dedup_log_flush_flow_rate_txgs
zfs_dedup_log_txg_max
zfs_dedup_log_mem_max
zfs_dedup_log_mem_max_percent
zfs_sap_shrink_enabled

These are documented in zfs(4).

Review guide

All these changes are interconnected but not all are directly related. This would make reviewing them as a single mass extremely difficult, which in turn increases the likelihood that the changes will either be waved through with bugs, or languish in the issue tracker forever.

To make review easier, we’ve tried to layer the patch stack into a logical series of changes, each building on the previous ones. The intent is that they can be reviewed in order, with the reviewer’s understanding of the changes and the system as a whole growing with each commit.

Our intent is that as the earlier PRs are reviewed and updated based on review feedback, the later ones will be rebased and pushed to match, and the fdt-rel combined branch updated too. Some of the earlier PRs that do not affect the on-disk format could be merged as they are approved, while the later ones we expect will be approved and “locked in place”, and once everything above them is approved, the whole log can be merged.

(Unfortunately, Github can’t easily handle a stack of PRs, only showing the changes between each one, so the later ones all show the commits for the earlier ones).

Its worth noting that at time of writing, ZTS coverage is still limited. There has been testing of course, both for performance and for function, but not enough to cover everything. We are fully expecting and intending that more tests will be created before these PRs are merged, and that work will happen within the scope of each PR.

PR list in review order:

#15887 dedup: cleanup and document

This is a collection of cleanups, refactors and documenting the existing dedup system. There should be no functional changes here at all, and we expect that this PR could be merged almost immediately without controversy.

#15888 zap: add shrinking support

This is a standalone PR that allows ZAPs to be shrunk, by collapsing empty sibling leaf blocks. It could provide a nice space improvement for high-churn ZAPs (eg ZPL directories), and is a prerequisite for the quota and prune features, as there’s no point pruning entries if we can’t reclaim they space they would use.

#15889 dedup: quota

This allows a quota to be set for the on-disk dedup table, with dedup effectively “disabled” for new entries. This is the first time its ever been possible for a block created in a dedup-enabled dataset to not be duplicated and not have a D bit set, and internally, the first time ddt_lookup() has ever been able to return NULL, so it represents something of a departure and is important to understand.

This is positioned in the stack before the on-disk format changes because this does not require a format change, and can work just fine on traditional dedup tables.

#15890 dedup: prefetch

This invokes the regular DMU prefetch code to get all dedup tables into the ARC, to try to reduce the time after importing the pool that performance suffers because most of the dedup tail are not in memory.

#15892 dedup: add fast_dedup feature and support for traditional and new on-disk formats

This adds the core of the fast_dedup feature itself: the “container” object for the table, which includes its config, and any objects that that form the table as a whole. Its designed to be extendable separately from the pool feature flag, that is, new FDT “subfeatures” could be added to an individual DDT without needing to break backward compatibility for all DDTs in the pool.

Reviewing this here means understanding the basic structure that the “flat entry” and “log” PRs fit into, since they each add an Fast Dedup “subfeature”.

#15893 dedup: “flat” entry format

This adds the “flat phys” subfeature, which reduces the in-memory and on-disk size of an individual entry by reducing the number of blocks stored in a single dedup entry from 4 to 1, and removes and reorganising things that aren’t needed in every entry.

This is where you will see the gymnastics required to retain compatibility with traditional dedup tables.

A significant chunk of this is in the IO pipeline in zio_ddt_write(), which is a quite involved rewrite to allow a single dedup entry to be “extended” with new DVAs.

#15894 dnode: allow storage class to be overridden by object type

This is a small standalone PR that provides a mechanism needed by the log feature. It’s included as a separate PR because the method may require a specific discussion.

#15895 dedup: log feature

This adds the “log” subfeature, which is a fast append-only on-disk object intended to buffer changes to the dedup ZAPs to allow them to be updated in batches, over multiple transactions, without competing with true user IO. That makes these feature very involved, mostly in the flushing machinery. Hopefully by now it will be clear how it fits with everything else.

#16277 dedup: prune

This adds a facility to remove unused unique entries from the dedup table, shrinking it down to make it more efficient to update. It is positioned last in the patch stack because it requires an on-disk format change to the dedup entry, subtly changes the meaning of the D block pointer flag, and requires some delicate interactions with the dedup log.

Discussion

General discussion about the OpenZFS dedup feature as a whole and feedback on the above can be included on this issue below. Specific feedback and review of the individual PRs should go to those PRs, to make sure we don’t miss anything.

leelists · 2024-08-23T07:35:07Z

leelists
Aug 23, 2024

I give it a try, and it behaved very well, until i deduped 7To on a 32Go system, and ddt eated all the system RAM :-(
Is there any tunable to make it swap out memory ?

1 reply

allanjude Aug 25, 2024
Collaborator Author

There isn't a way to swap out kernel memory, however, one of the core features of Fast Dedup is the dedup quota mechanism.

This allows you to limit the size of the DDT so that it will not take all of your memory. Note however that what this does it stop deduping new blocks once you reach the configured maximum size of the DDT. This ensures the DDT will fit in memory so that it will always be fast.

just do:
zpool set dedup_table_quota=16G $poolname and ZFS will not allow the DDT to grow beyond 16GB.

you can also see the current size with:
zpool get dedup_table_size

The last pull request in this series, which has not been merged yet, also adds the ability to prune the DDT to remove entries for which there are only a single copy (not dedupped) which can be used to reduce the size of an existing DDT, or to keep the DDT under the quota limited size, replacing "old" unused dedup entries with new ones that might have a better chance of finding duplicates.

hanneskasp · 2024-10-09T08:39:51Z

hanneskasp
Oct 9, 2024

Hello,
I gave it a try with 2.3.0-RC1 and it feels like it's not working in "real life". If I copy a file, then I get 100% dedupe ratio. So it's technically working I would say. But if I do full backups of the same machine twice, then I get no space savings from deduplication.

If I compare it with commercial dedupe appliances, then the dedupe ratio is close to 100% for the second full backup. For the initial backup, it's around 22GB vs. 19GB. That difference is okay I would say. The commercial appliance is around 15% better. That is fair for me.

The test machine that I'm backing up is a Windows Domain Controller that idles (lab environment).

This is the space usage after one backup

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup 992G  22.3G   970G        -         -     1%     2%  1.00x    ONLINE  -

This is the space usage after two backups

#zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup 992G  44.4G   948G        -         -     1%     4%  1.00x    ONLINE  -

when I then copy one of the backup files (one backup file = one machine in my case), then the dedupe ratio goes up

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup   992G  44.5G   948G        -         -     1%     4%  1.50x    ONLINE  -

the DDT has entries, so I think it should do something

# zpool status -D
  pool: backup
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        backup       ONLINE       0     0     0
          sdb       ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 533500, size 265M on disk, 191M in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     260K   32.6G   21.6G   21.6G     260K   32.6G   21.6G   21.6G
     2     260K   32.5G   21.6G   21.6G     520K   65.0G   43.2G   43.2G
     4      717   89.6M   74.4M   74.4M    2.91K    372M    305M    305M
     8       72      9M   5.96M   5.96M      800    100M   66.1M   66.1M
    16       43   5.38M   3.76M   3.76M      916    114M   80.0M   80.0M
    32       15   1.88M   1.20M   1.20M      638   79.8M   50.8M   50.8M
    64        7    896K    608K    608K      531   66.4M   45.0M   45.0M
 Total     521K   65.1G   43.3G   43.3G     786K   98.2G   65.4G   65.4G

Am I doing something wrong, or was my expectation too high?

Best regards,
Hannes

3 replies

hanneskasp Oct 9, 2024

I did some more tests. Now with a powered off VM (a VMware template). Here I see 55% dedup while it should be 100% dedup because it's 100% the same data.

first backup

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup 992G  14.6G   977G        -         -     2%     1%  1.00x    ONLINE  -

second backup

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup 992G  21.3G   971G        -         -     1%     2%  1.55x    ONLINE  -

amotin Oct 9, 2024
Collaborator

one backup file = one machine in my case

If you are concatenating all files from a system into one backup file, then minimal difference in one file size may change data offset within backup file relative to ZFS block boundaries, which make file blocks no longer identical and so not dedupable.

hanneskasp Oct 9, 2024

okay, thanks for confirmation. Then I guess it does not work for our backup use-case.

satmandu · 2024-12-06T14:25:33Z

satmandu
Dec 6, 2024

Has ram usage changed for Fast Dedup vs the previous dedup implementation?

Is the recommendation still somewhere in the 1-5GB/RAM per TB of dedup-enabled pool?

5 replies

amotin Dec 6, 2024
Collaborator

The memory consumption is expected to reduce a bit, but not dramatically to change recommendations. Actually it linearly depends on block/file sizes, so the recommendation is just a rule of thumb. But now with introduction of quotas and pruning you might be able to keep it in better shape even if mispredicted.

allanjude Dec 8, 2024
Collaborator Author

Expanding on what amotin said, you can set the dedup-quota to the amount of memory you want dedup to be able to use, and it will operate within that. Dedup will be limited when you reach the quota, but the new dedup prune feature can remove older entries to allow new entries to continue to dedup

allanjude Dec 8, 2024
Collaborator Author

https://klarasystems.com/articles/introducing-openzfs-fast-dedup/

IvanVolosyuk Dec 9, 2024

What is unclear to me is the affect of pruning. It will reduce efficiency of dedup, but I have no intuition in what way. Will file copy be detected? I mean if something is not in dedup table, but it is in arc, will it cause the deduplication to happen? What if both fast dedup an brt are enabled?

robn Dec 9, 2024
Collaborator

@IvanVolosyuk pruning removes entries from the dedup table that have never been deduplicated (ie refcount = 1). The block is still on disk as it always was, just "detached" from the dedup system. So, if a duplicate of that block is written in the future, it won't be found in the dedup table, and the block will be written as normal and a new entry created for it.

Everything else works the same. The ARC has never been involved in dedup, and continues not to be. The interaction between dedup and block cloning remains the same: if a cloned block is in the dedup table, the refcount is bumped, if not, the entry(s) in the BRT are bumped.

ipepe · 2024-12-10T03:29:32Z

ipepe
Dec 10, 2024

Where I can read more about Fast Dedup feature? My questions are:

Is Fast Dedup an extension on top of normal dedup or a separate functionality?
If I create a pool with dedup on ZFS 2.1 and upgrade to 2.3 and enable fast dedup would I see benefits or do I need to recreate whole pool with Fast dedup on?

4 replies

ipepe Dec 10, 2024

Also additionally if I purge records, is there a command to reverse this operation? Basically read the blocks saved on the pool that are not referenced in DDT and put them in there? It would be amazing if that command could accept threshold of =2 because as far as I understand in scenario where:

Save the unique file to the pool as file1
Run purge
Save the same file from point 1 to the pool as file2
Run purge

In this scenario, running the reverse-purge with dedup threshold x>1 would find the file and dedup it. Alternatively, you could run dedup and purge again but that would be inefficient although probably much safer and interruptable operation.

amotin Dec 10, 2024
Collaborator

There is no such command. Pruning is a way to save constrained resources, so "unsaving" them sounds weird. But if the block pointers of file1 have the "D" they can theoretically be re-added into DDT. But the file2 in your case already has separate block allocated, so you would effectively need to rewrite it, changing the block pointers to ones from file1 and freeing old blocks of file2. That is what block cloning of file1 into file2 would actually do if file1's blocks are somehow reinstated in DDT first. Or if not reinstated, a record would just be added into BRT instead. But if you know from the beginning that file2 is a copy of file1, just use cheaper block cloning instead of dedup.

ipepe Dec 10, 2024

Pruning is a way to save constrained resources, so "unsaving" them sounds weird.

If I would have a daily / weekly / monthly pruning schedule, but no control over what users copy onto networked fileshare, then this is very typical use case.

amotin Dec 10, 2024
Collaborator

If you have no control over the data and expect that anything can be deduplicated, then you better have enough resources to not require pruning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Dedup #15896

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fast Dedup #15896

allanjude Feb 14, 2024 Collaborator

Fast Dedup Review Guide

Overview

Trying it

Review guide

PR list in review order:

Discussion

Replies: 4 comments · 13 replies

allanjude Aug 25, 2024 Collaborator Author

amotin Oct 9, 2024 Collaborator

amotin Dec 6, 2024 Collaborator

allanjude Dec 8, 2024 Collaborator Author

allanjude Dec 8, 2024 Collaborator Author

robn Dec 9, 2024 Collaborator

amotin Dec 10, 2024 Collaborator

amotin Dec 10, 2024 Collaborator

allanjude
Feb 14, 2024
Collaborator

Replies: 4 comments 13 replies

allanjude Aug 25, 2024
Collaborator Author

amotin Oct 9, 2024
Collaborator

amotin Dec 6, 2024
Collaborator

allanjude Dec 8, 2024
Collaborator Author

allanjude Dec 8, 2024
Collaborator Author

robn Dec 9, 2024
Collaborator

amotin Dec 10, 2024
Collaborator

amotin Dec 10, 2024
Collaborator