update the paper list

yzr95924 · Sep 23, 2022 · 41a0e40 · 41a0e40
1 parent 6b960f9
commit 41a0e40
Show file tree

Hide file tree

Showing 6 changed files with 366 additions and 124 deletions.
diff --git a/README.md b/README.md
diff --git a/StoragePaperNote/Deduplication/Post-Dedup/MigratoryCompression-FAST'14.md b/StoragePaperNote/Deduplication/Post-Dedup/MigratoryCompression-FAST'14.md
@@ -0,0 +1,88 @@
+---
+typora-copy-images-to: ../paper_figure
+---
+# Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility
+
+|           Venue            |       Category       |
+| :------------------------: | :------------------: |
+| FAST'14 | Post Deduplication, Compression |
+[TOC]
+
+## 1. Summary
+### Motivation of this paper
+
+- motivation
+  - compression can find redundancy among strings within a **limited** distance (window size)
+    - can limit the finding overhead
+    - gzip: 64 KiB sliding window, 7z: up to 1 GiB
+  - windows sizes are small, and similarity across a large distance will not be identified
+    - traditional compressors are unable to exploit redundancy across a large range of data (e.g., many GB)
+
+![image-20220615220333487](../paper_figure/image-20220615220333487.png)
+
+### Migratory Compression (MC)
+
+- main idea
+  - coarse-grained reorganization to **group similar blocks** to improve compressibility
+    - include a generic **pre-processing** stage for standard compressors
+  - reorder chunks to store similar chunks sequentially, increasing compressors' opportunity to detect redundant strings and leading to better compression
+- two use cases
+  - mzip: using MC to compress a single file, integrating MC with traditional compressor (e.g., gzip)
+  - archival: data migration from backup storage systems to archive tiers
+    - ![image-20220615221046364](../paper_figure/image-20220615221046364.png)
+- design considerations
+  - partition into blocks, calculate similarity features
+  - group by content and **identify duplicate and similar blocks**
+    - output *migrate* and  *restore* recipe
+      - migrate recipe: the order after the rearrangement
+      - restore recipe: the order of the original file based on the rearrangement
+
+  - rearrange the input file
+    - a large number of I/Os necessary to reorganize the original data
+    - block-level
+      - random I/Os
+      - fine for memory and SSD
+
+    - multi-pass (HDD)
+      - convert random I/Os into sequential scans
+
+
+### Implementation and Evaluation
+
+- implementation
+  - use xdelta for delta encoding, the chunk earliest in the file is selected as the base for each group of similar chunks
+  - based on DDFS
+    - an active tier for backups
+    - a long-term retention tier for archival
+    - in-memory, SSD, HDD
+- evaluation
+  - datasets
+    - private backup workloads (6 GiB - 28 GiB)
+  - compression effectiveness and performance trade-off
+    - combine with different compression algoes
+  - data reorganization throughput
+    - test with in-memory, HDD, SSD
+  - delta compression
+    - compare with DC, very little improvement (0.81%)
+  - sensitivity to different parameters
+    - chunk size
+    - chunking algo
+    - compression window
+
+## 2. Strength (Contributions of the paper)
+
+- the idea is very simple and easy to follow
+  - improve both CF and compression throughput via **deduplication** and **re-organization**
+- very extensive experiments
+  - try to tune every possible parameter and explain the underlying reasons behind the results
+
+## 3. Weakness (Limitations of the paper)
+
+- the novelty: its idea, in a sense, is a coarse-grained BTW over **a large range** (tens of GBs or more). Not very novel.
+- compared with delta compression, the improvement is very limited
+
+## 4. Some Insights (Future work)
+
+- the ways to improve compressibility
+  - increasing the look-back window
+  - reordering data
diff --git a/StoragePaperNote/MeGA-ATC'22.md b/StoragePaperNote/MeGA-ATC'22.md
@@ -0,0 +1,70 @@
+---
+typora-copy-images-to: ../paper_figure
+---
+# Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio
+
+|           Venue            |       Category       |
+| :------------------------: | :------------------: |
+| ATC'22 | Deduplication |
+[TOC]
+
+## 1. Summary
+### Motivation of this paper
+
+- motivation: fine-grained deduplication suffers from poor backup/restore performance
+  - introduces delta compression to exploit more compressibility among workloads, so workloads share more data, decreasing locality -> increasing I/O overhead
+  - this paper address issues for different forms of **poor locality** in fine-grained deduplication
+- problem
+  - **reading base issue**: reading base chunks from delta encoding (in backup process)
+    - inefficient I/O when reading base chunks
+  - **fragmentation issue**: caused by a new kind of reference relationship between delta and base chunks (break **spatial locality**)
+    - delta-base relationships lead to more complex fragmentation problems than deduplication alone
+  - **repeatedly accessing issue**: repeatedly access containers to gather delta-base pairs (break **temporal locality**)
+    - delta-base dependencies cause poor temporal locality
+
+### MeGA
+
+- selective delta compression
+  - insights: base chunks are not distributed evenly -> base-sparse containers
+  - skips delta compression whose base chunks are located in "base-sparse containers"
+    - avoid reading "inefficient" containers
+- delta-friendly data layout
+  - change order-based data layout -> lifecycle-based data layout
+    - classifies chunks into categories according to whether they are always referenced by the same set of consecutive backup workloads
+  - two-level reference: **directly** referenced chunks and its **indirectly** referenced chunks
+  - to simplify the implementation, only deduplicate redundancies between **adjacent backups** to ensure chunks' lifecycles are always consecutive (similar to MFDedup)
+- forward reference and delta prewriting
+  - when performing a restore, delta-encoded chunks are always accessed **before** their base chunks
+    - ensure all restore-involved containers only need to be read only once
+  - user space and backup space are **asymmetric**
+    - user space: SSDs or NVMs
+    - backup space: HDDs
+  - prewrite delta chunks in the to-be-restored backup workload (in User space)
+- ![image-20220912232446270](..\paper_figure\image-20220912232446270.png)
+
+### Implementation and Evaluation
+
+- baselines
+  - Greedy, FGD (fine-grained deduplication with Capping), CLD (chunk-level deduplication with Capping), and MFD (FAST'21)
+- traces: WEB, CHM, SYN, and VMS
+- backup speed, restore speed, and deduplication ratio
+- I/O overhead in maintaining data layout
+  - maintenance costs v.s. GC costs
+
+## 2. Strength (Contributions of the paper)
+
+- analyze several forms of poor locality caused by fine-grained deduplication
+  - additional I/O overhead -> poor backup/restore performance
+- several designs: delta selector, delta friendly data layout, always-forward-reference traversing, and delta prewriting 
+
+## 3. Weakness (Limitations of the paper)
+
+- hard to follow, especially for the third design
+- need a maintenance process to adjust the layout
+  - overhead is high 0.32-1.92x the GC I/O overhead
+
+## 4. Some Insights (Future work)
+
+- term: call "delta compression" as "fine-grained deduplication"
+- all deduplicated chunks are stored in containers in order, and then each container will be compressed
+  - compression unit: a container
diff --git a/paper_figure/image-20220615220333487.png b/paper_figure/image-20220615220333487.png
diff --git a/paper_figure/image-20220615221046364.png b/paper_figure/image-20220615221046364.png
diff --git a/paper_figure/image-20220912232446270.png b/paper_figure/image-20220912232446270.png