Skip to content

Commit

Permalink
update the paper list
Browse files Browse the repository at this point in the history
  • Loading branch information
yzr95924 committed Sep 23, 2022
1 parent 6b960f9 commit 41a0e40
Show file tree
Hide file tree
Showing 6 changed files with 366 additions and 124 deletions.
332 changes: 208 additions & 124 deletions README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
typora-copy-images-to: ../paper_figure
---
# Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility

| Venue | Category |
| :------------------------: | :------------------: |
| FAST'14 | Post Deduplication, Compression |
[TOC]

## 1. Summary
### Motivation of this paper

- motivation
- compression can find redundancy among strings within a **limited** distance (window size)
- can limit the finding overhead
- gzip: 64 KiB sliding window, 7z: up to 1 GiB
- windows sizes are small, and similarity across a large distance will not be identified
- traditional compressors are unable to exploit redundancy across a large range of data (e.g., many GB)

![image-20220615220333487](../paper_figure/image-20220615220333487.png)

### Migratory Compression (MC)

- main idea
- coarse-grained reorganization to **group similar blocks** to improve compressibility
- include a generic **pre-processing** stage for standard compressors
- reorder chunks to store similar chunks sequentially, increasing compressors' opportunity to detect redundant strings and leading to better compression
- two use cases
- mzip: using MC to compress a single file, integrating MC with traditional compressor (e.g., gzip)
- archival: data migration from backup storage systems to archive tiers
- ![image-20220615221046364](../paper_figure/image-20220615221046364.png)
- design considerations
- partition into blocks, calculate similarity features
- group by content and **identify duplicate and similar blocks**
- output *migrate* and *restore* recipe
- migrate recipe: the order after the rearrangement
- restore recipe: the order of the original file based on the rearrangement

- rearrange the input file
- a large number of I/Os necessary to reorganize the original data
- block-level
- random I/Os
- fine for memory and SSD

- multi-pass (HDD)
- convert random I/Os into sequential scans


### Implementation and Evaluation

- implementation
- use xdelta for delta encoding, the chunk earliest in the file is selected as the base for each group of similar chunks
- based on DDFS
- an active tier for backups
- a long-term retention tier for archival
- in-memory, SSD, HDD
- evaluation
- datasets
- private backup workloads (6 GiB - 28 GiB)
- compression effectiveness and performance trade-off
- combine with different compression algoes
- data reorganization throughput
- test with in-memory, HDD, SSD
- delta compression
- compare with DC, very little improvement (0.81%)
- sensitivity to different parameters
- chunk size
- chunking algo
- compression window

## 2. Strength (Contributions of the paper)

- the idea is very simple and easy to follow
- improve both CF and compression throughput via **deduplication** and **re-organization**
- very extensive experiments
- try to tune every possible parameter and explain the underlying reasons behind the results

## 3. Weakness (Limitations of the paper)

- the novelty: its idea, in a sense, is a coarse-grained BTW over **a large range** (tens of GBs or more). Not very novel.
- compared with delta compression, the improvement is very limited

## 4. Some Insights (Future work)

- the ways to improve compressibility
- increasing the look-back window
- reordering data
70 changes: 70 additions & 0 deletions StoragePaperNote/MeGA-ATC'22.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
typora-copy-images-to: ../paper_figure
---
# Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio

| Venue | Category |
| :------------------------: | :------------------: |
| ATC'22 | Deduplication |
[TOC]

## 1. Summary
### Motivation of this paper

- motivation: fine-grained deduplication suffers from poor backup/restore performance
- introduces delta compression to exploit more compressibility among workloads, so workloads share more data, decreasing locality -> increasing I/O overhead
- this paper address issues for different forms of **poor locality** in fine-grained deduplication
- problem
- **reading base issue**: reading base chunks from delta encoding (in backup process)
- inefficient I/O when reading base chunks
- **fragmentation issue**: caused by a new kind of reference relationship between delta and base chunks (break **spatial locality**)
- delta-base relationships lead to more complex fragmentation problems than deduplication alone
- **repeatedly accessing issue**: repeatedly access containers to gather delta-base pairs (break **temporal locality**)
- delta-base dependencies cause poor temporal locality

### MeGA

- selective delta compression
- insights: base chunks are not distributed evenly -> base-sparse containers
- skips delta compression whose base chunks are located in "base-sparse containers"
- avoid reading "inefficient" containers
- delta-friendly data layout
- change order-based data layout -> lifecycle-based data layout
- classifies chunks into categories according to whether they are always referenced by the same set of consecutive backup workloads
- two-level reference: **directly** referenced chunks and its **indirectly** referenced chunks
- to simplify the implementation, only deduplicate redundancies between **adjacent backups** to ensure chunks' lifecycles are always consecutive (similar to MFDedup)
- forward reference and delta prewriting
- when performing a restore, delta-encoded chunks are always accessed **before** their base chunks
- ensure all restore-involved containers only need to be read only once
- user space and backup space are **asymmetric**
- user space: SSDs or NVMs
- backup space: HDDs
- prewrite delta chunks in the to-be-restored backup workload (in User space)
- ![image-20220912232446270](..\paper_figure\image-20220912232446270.png)

### Implementation and Evaluation

- baselines
- Greedy, FGD (fine-grained deduplication with Capping), CLD (chunk-level deduplication with Capping), and MFD (FAST'21)
- traces: WEB, CHM, SYN, and VMS
- backup speed, restore speed, and deduplication ratio
- I/O overhead in maintaining data layout
- maintenance costs v.s. GC costs

## 2. Strength (Contributions of the paper)

- analyze several forms of poor locality caused by fine-grained deduplication
- additional I/O overhead -> poor backup/restore performance
- several designs: delta selector, delta friendly data layout, always-forward-reference traversing, and delta prewriting

## 3. Weakness (Limitations of the paper)

- hard to follow, especially for the third design
- need a maintenance process to adjust the layout
- overhead is high 0.32-1.92x the GC I/O overhead

## 4. Some Insights (Future work)

- term: call "delta compression" as "fine-grained deduplication"
- all deduplicated chunks are stored in containers in order, and then each container will be compressed
- compression unit: a container
Binary file added paper_figure/image-20220615220333487.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20220615221046364.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20220912232446270.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 41a0e40

Please sign in to comment.