-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
366 additions
and
124 deletions.
There are no files selected for viewing
88 changes: 88 additions & 0 deletions
88
StoragePaperNote/Deduplication/Post-Dedup/MigratoryCompression-FAST'14.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
--- | ||
typora-copy-images-to: ../paper_figure | ||
--- | ||
# Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility | ||
|
||
| Venue | Category | | ||
| :------------------------: | :------------------: | | ||
| FAST'14 | Post Deduplication, Compression | | ||
[TOC] | ||
|
||
## 1. Summary | ||
### Motivation of this paper | ||
|
||
- motivation | ||
- compression can find redundancy among strings within a **limited** distance (window size) | ||
- can limit the finding overhead | ||
- gzip: 64 KiB sliding window, 7z: up to 1 GiB | ||
- windows sizes are small, and similarity across a large distance will not be identified | ||
- traditional compressors are unable to exploit redundancy across a large range of data (e.g., many GB) | ||
|
||
 | ||
|
||
### Migratory Compression (MC) | ||
|
||
- main idea | ||
- coarse-grained reorganization to **group similar blocks** to improve compressibility | ||
- include a generic **pre-processing** stage for standard compressors | ||
- reorder chunks to store similar chunks sequentially, increasing compressors' opportunity to detect redundant strings and leading to better compression | ||
- two use cases | ||
- mzip: using MC to compress a single file, integrating MC with traditional compressor (e.g., gzip) | ||
- archival: data migration from backup storage systems to archive tiers | ||
-  | ||
- design considerations | ||
- partition into blocks, calculate similarity features | ||
- group by content and **identify duplicate and similar blocks** | ||
- output *migrate* and *restore* recipe | ||
- migrate recipe: the order after the rearrangement | ||
- restore recipe: the order of the original file based on the rearrangement | ||
|
||
- rearrange the input file | ||
- a large number of I/Os necessary to reorganize the original data | ||
- block-level | ||
- random I/Os | ||
- fine for memory and SSD | ||
|
||
- multi-pass (HDD) | ||
- convert random I/Os into sequential scans | ||
|
||
|
||
### Implementation and Evaluation | ||
|
||
- implementation | ||
- use xdelta for delta encoding, the chunk earliest in the file is selected as the base for each group of similar chunks | ||
- based on DDFS | ||
- an active tier for backups | ||
- a long-term retention tier for archival | ||
- in-memory, SSD, HDD | ||
- evaluation | ||
- datasets | ||
- private backup workloads (6 GiB - 28 GiB) | ||
- compression effectiveness and performance trade-off | ||
- combine with different compression algoes | ||
- data reorganization throughput | ||
- test with in-memory, HDD, SSD | ||
- delta compression | ||
- compare with DC, very little improvement (0.81%) | ||
- sensitivity to different parameters | ||
- chunk size | ||
- chunking algo | ||
- compression window | ||
|
||
## 2. Strength (Contributions of the paper) | ||
|
||
- the idea is very simple and easy to follow | ||
- improve both CF and compression throughput via **deduplication** and **re-organization** | ||
- very extensive experiments | ||
- try to tune every possible parameter and explain the underlying reasons behind the results | ||
|
||
## 3. Weakness (Limitations of the paper) | ||
|
||
- the novelty: its idea, in a sense, is a coarse-grained BTW over **a large range** (tens of GBs or more). Not very novel. | ||
- compared with delta compression, the improvement is very limited | ||
|
||
## 4. Some Insights (Future work) | ||
|
||
- the ways to improve compressibility | ||
- increasing the look-back window | ||
- reordering data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
typora-copy-images-to: ../paper_figure | ||
--- | ||
# Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio | ||
|
||
| Venue | Category | | ||
| :------------------------: | :------------------: | | ||
| ATC'22 | Deduplication | | ||
[TOC] | ||
|
||
## 1. Summary | ||
### Motivation of this paper | ||
|
||
- motivation: fine-grained deduplication suffers from poor backup/restore performance | ||
- introduces delta compression to exploit more compressibility among workloads, so workloads share more data, decreasing locality -> increasing I/O overhead | ||
- this paper address issues for different forms of **poor locality** in fine-grained deduplication | ||
- problem | ||
- **reading base issue**: reading base chunks from delta encoding (in backup process) | ||
- inefficient I/O when reading base chunks | ||
- **fragmentation issue**: caused by a new kind of reference relationship between delta and base chunks (break **spatial locality**) | ||
- delta-base relationships lead to more complex fragmentation problems than deduplication alone | ||
- **repeatedly accessing issue**: repeatedly access containers to gather delta-base pairs (break **temporal locality**) | ||
- delta-base dependencies cause poor temporal locality | ||
|
||
### MeGA | ||
|
||
- selective delta compression | ||
- insights: base chunks are not distributed evenly -> base-sparse containers | ||
- skips delta compression whose base chunks are located in "base-sparse containers" | ||
- avoid reading "inefficient" containers | ||
- delta-friendly data layout | ||
- change order-based data layout -> lifecycle-based data layout | ||
- classifies chunks into categories according to whether they are always referenced by the same set of consecutive backup workloads | ||
- two-level reference: **directly** referenced chunks and its **indirectly** referenced chunks | ||
- to simplify the implementation, only deduplicate redundancies between **adjacent backups** to ensure chunks' lifecycles are always consecutive (similar to MFDedup) | ||
- forward reference and delta prewriting | ||
- when performing a restore, delta-encoded chunks are always accessed **before** their base chunks | ||
- ensure all restore-involved containers only need to be read only once | ||
- user space and backup space are **asymmetric** | ||
- user space: SSDs or NVMs | ||
- backup space: HDDs | ||
- prewrite delta chunks in the to-be-restored backup workload (in User space) | ||
-  | ||
|
||
### Implementation and Evaluation | ||
|
||
- baselines | ||
- Greedy, FGD (fine-grained deduplication with Capping), CLD (chunk-level deduplication with Capping), and MFD (FAST'21) | ||
- traces: WEB, CHM, SYN, and VMS | ||
- backup speed, restore speed, and deduplication ratio | ||
- I/O overhead in maintaining data layout | ||
- maintenance costs v.s. GC costs | ||
|
||
## 2. Strength (Contributions of the paper) | ||
|
||
- analyze several forms of poor locality caused by fine-grained deduplication | ||
- additional I/O overhead -> poor backup/restore performance | ||
- several designs: delta selector, delta friendly data layout, always-forward-reference traversing, and delta prewriting | ||
|
||
## 3. Weakness (Limitations of the paper) | ||
|
||
- hard to follow, especially for the third design | ||
- need a maintenance process to adjust the layout | ||
- overhead is high 0.32-1.92x the GC I/O overhead | ||
|
||
## 4. Some Insights (Future work) | ||
|
||
- term: call "delta compression" as "fine-grained deduplication" | ||
- all deduplicated chunks are stored in containers in order, and then each container will be compressed | ||
- compression unit: a container |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.