Question: Are there hidden costs to enabling dedup on a rarely-used dataset? #10676
-
System information
I understand (and have verified with testing) that using dedup is typically a bad idea except for specific cases. I have a case for an archival dataset that's rarely written or read (a big metadata crawl and ~120MB of writes daily in a burst); I believe there's fair redundancy across files in the dataset and I don't care hugely about read/write performance. I'd like to permit dedup on this dataset but I was wondering; is there a resource usage overhead added for the non-dedup datasets in the same pool? Like, anything persistent in memory or extra indirection that affects those non-dedup datasets whose performance I do care about? The net has conflicting opinions, as usual. Cheers. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
Dedup will affect only the datasets where it is (or ever had been) activated on disk. |
Beta Was this translation helpful? Give feedback.
-
dedup table is pool wide so any destroy operation on the pool at all has to traverse it, as i understand. |
Beta Was this translation helpful? Give feedback.
-
The DDT are pool wide, but they have to be consulted only in case a block is freed that is marked as deduplicated. |
Beta Was this translation helpful? Give feedback.
-
and how does it know? by traversing the DDT. |
Beta Was this translation helpful? Give feedback.
-
No. The block pointer contains a flag when the target is deduplicated. |
Beta Was this translation helpful? Give feedback.
Dedup will affect only the datasets where it is (or ever had been) activated on disk.
File deletes and snapshot destroys might force DDT reads, this will reduce overall available bandwidth of the pool and can increase IO latency for non-dedup filesystems. The DDT will occupy a part of memory available to ARC, this can impact caching efficiency for other datasets.