Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
yzr95924 committed Feb 25, 2022
1 parent 7d94eaa commit 76151d3
Show file tree
Hide file tree
Showing 169 changed files with 1,890 additions and 41 deletions.
135 changes: 97 additions & 38 deletions README.md

Large diffs are not rendered by default.

107 changes: 107 additions & 0 deletions StoragePaperNote/Deduplication/Cache-Dedup/AustereCache-ATC'20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
typora-copy-images-to: ../paper_figure
---
Austere Flash Caching with Deduplication and Compression
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| USENIX ATC'20 | I/O deduplication |
[TOC]

## 1. Summary
### Motivation of this paper

- motivation
- deduplication and compression are promising data reduction for I/O savings via the removal of duplicate content
- yet incur **substantial memory overhead for index management**
- deduplication: removes chunk-level duplicates
- compression: removes byte-level duplicates within chunks
- flash caching
- building an SSD-based flash cache to boost the I/O performance of HDD-based primary storage
- storing the frequently accessed data in the flash cache
- conventional flash caching needs an SSD-HDD translation layer that maps **each LBA in an HDD** to **a chunk address (CA) in the flash cache**
- LBA-index
- track how each LBA is mapped to the FP of a chunk (**many-to-one**: as multiple LBAs may refer to the same FP)
- FP-index
- tracks how each FP is mapped to the CA and the length of a compressed chunk (**one-to-one**)
- memory amplification
- **conventional flash caching**:
- only needs to index (LBA, CA) pair
- with deduplication and compression
- LBA-index: (LBA, FP) pairs
- FP-index: (FP, CA) pairs

### Austere Cache

- Bucketization
- partitions both the LBA-index and the FP-index into **equal-size buckets** composed of a fixed number of **equal-size slots**
- each slot corresponds to an LBA and an FP
- divide the flash cache space into **a metadata region** and **a data region**
- also partitioned into buckets with multiple slots (**same numbers** of buckets and slots as in the FP-index), FP-index is **one-to-one** mapping to **the same slots** in the metadata and data regions
- ![image-20211129001332809](../paper_figure/image-20211129001332809.png)
- Reduce the memory usage
- each slot only **the prefix of a key**, rather than a full key
- computes the hashes of both the LBA and the FP
- resolve the hash collisions
- maintain the full LBA and FP information in the metadata region in flash
- any hash collision only leads to a cache miss without data loss
- write path: **input (LBA, FP)**
- update the LBA-index and the FP-index (use **the suffix bits** to identify the bucket)
- **scan all slots in the corresponding bucket** to see if the LBA-hash prefix has already been stored
- If the slot is full and cannot store more LBAs, it evicts **the oldest LBA using FIFO**
- deduplication path: **input (LBA, FP)**
- use the FP-hash to find the corresponding buckets, and then checks **FP-hash prefix**
- further check the corresponding slot in the metadata region in the flash, and verifies the input FP matches the one in the slot
- read path: **input (LBA)**
- check the LBA-index for **the FP-hash** using the LBA-hash prefix
- further querying the FP-index for the slot that contains the FP-hash prefix, then check the corresponding slot in the metadata region
- Fixed-size compressed data management
- **Avoid tracking the length of the compressed chunk** in the index structures
- slice a compressed chunk into fixed-size subchunks
- **the last subchunk is padded** to fill a subchunk size
- subchunk size is 8 KiB
- LBA-index remains unchanged
- **allocate a slot for each subchunk in FP-index (data region, metadata region)**
- find **consecutive slots in the FP-index** for the multiple subchunks of a compressed chunk
- Bucket-based cache replacement
- the cache replacement decisions are based on **only the entries within each bucket**
- incurs limited performance overhead
- a bucket-based LRU policy
- each bucket sorts all slots by the recency of their LBAs
- the slots **at the lower offsets correspond to the more recently accessed LBAs**
- does not incur any extra memory overhead for maintaining the recency information of all slots
- use CM-Sketch to track the reference count of all FP-hashes

### Implementation and Evaluation

- Implementation
- issues reads and writes to the underlying storage devices via **pread** and **pwrite** system calls
- XXHash for fast hash computations in the index structures
- SHA-1 from Intel ISA-L
- bucket-level concurrency
- Evaluation
- Trace: FIU, Synthetic
- Baseline: CacheDedup
- Comparative analysis
- Overall memory usage, impact of design techniques on memory saving, read hit ratio, write reduction ratio
- Sensitivity to parameters
- Impact of chunk sizes and subchunk sizes, impact of LBA-index sizes
- Throughput and CPU overhead
- Throughput, CPU overhead, Throughput of multi-threading

## 2. Strength (Contributions of the paper)

- use the slot mapping to avoid storing CA
- good evaluation
- very comprehensive

## 3. Weakness (Limitations of the paper)

## 4. Some Insights (Future work)

- background of flash deduplication and compression
- extra metadata:
- the mappings of each logical address to the physical address of the non-duplicate chunk **in the flash cache**
- **the cryptographic hashes** of all stored chunks in the flash cache
- **the lengths** of all compressed chunks that are of variable size
- **fixed-size** chunks fit better into flash units
122 changes: 122 additions & 0 deletions StoragePaperNote/Deduplication/Mem-Block-Dedup/CAFTL-FAST'11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
typora-copy-images-to: ../paper_figure
---
GoSeed: Generating an Optimal Seeding Plan for Deduplicated Storage
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| FAST'11 | FTL Deduplication |
[TOC]

## 1. Summary
### Motivation of this paper

- Motivation
- the limited lifespan of SSDs, which are built on flash memories with limited erase/program cycles, is still one of the most critical concerns
- as bit density increases, high probability of correlated device failures in SSD-based RAID, endurance and retention (of SSDs) not yet proven in the field
- The lifespan of SSDs
- The amount of incoming write traffic
- The size of over-provisioned flash space
- The efficiency of garbage collection and wear-leveling mechanisms
- Data deduplication
- data duplication
- slice the disk space into 4KiB blocks and use the SHA-1 hash function to calculate a 160-bit hash value for each block
- the duplication rate ranges from **7.9% to 85.9%** across the 15 disk
- I/O duplication
- intercept each I/O request and calculate a hash value for each requested block
- **5.8-28.1%** of the writes are duplicated
- Challenges
- limited resources: with limited memory space and computing power
- relatively lower redundancy: **much lower duplication rate** than that of backup streams
- low overhead requirement

### CAFTL (Content-Aware Flash Translation Layer)

- Main goal
- effectively reduce write traffic to flash memory by removing unnecessary duplicate writes and can also substantially extend available free flash memory space
- The main workflow
- intercepts incoming write requests **at the SSD device level**
- use a hash function to generate fingerprints summarizing the content of updated data
- querying a fingerprint store
- design a set of acceleration method to speed up fingerprinting
- does not need to change the standard host/device interface (without file system)
- small on-device buffer spaces (e.g., 2MiB) and make performance overhead nearly negligible
- Design overview
- a combination of both in-line and out-of-line deduplication
- does not guarantee that all duplicate writes can be examined and removed immediately
- ![image-20211122232836344](../paper_figure/image-20211122232836344.png)
- Hashing
- adopts a **fixed-sized** chunking approach (operation unit in flash is a page (e.g., 4KiB))
- calculate a **SHA-1** hash value as its fingerprint and store it as the page's metadata in flash
- Fingerprint store
- manage an **in-memory** structure
- only store and search in the **most likely-to-be-duplicated fingerprints** in memory
- {fingerprint, (PBA/VBA, reference)}
- consider fingerprints with a reference counter larger that 255 as highly referenced
- optimization
- range check, hotness-based reorganization, bucket-level binary search
- move the hot fingerprints closer to the list head and potentially reduces the number of the scanned buckets
- Indirect mapping
- a mapping table to track the physical block address (PBA) to which each LBA is mapped
- N-to-1 mapping
- maintain a primary mapping and a secondary mapping table in memory
- LBA->VBA->PBA
- track the number of referencing logical pages
- must be able to identify quickly all the logical pages mapped to this physical page and update their mapping entries to point to the new location (GC)
- ![image-20211123000856695](../paper_figure/image-20211123000856695.png)
- the mapping tables in flash
- when updating the in-memory tables, update record is logged into a small in-memory buffer
- the metadata pages in flash
- reserve a dedicated number of flash pages (metadata page: LBA and fingerprint)
- keep a metadata page array for tracking PBAs of the metadata pages
- Acceleration methods
- Sampling for hashing
- select the first four bytes from each page in a write request
- Light-weight pre-hashing
- first CRC-32 then SHA-1
- Dynamic switches
- high watermark, low watermark to turn the in-line deduplication off and on
- Out-of-line deduplication
- scan the metadata page array to find physical pages not yet fingerprinted
- perform together with the GC process or independently

### Implementation and Evaluation

- Implementation
- SSD simlator
- DiskSim simulation environment
- the indirect mapping, garbage collection and wear-leveling policies, and others
- when a write request is received at the SSD, it is first buffered in the cache, and the SSD reports completion to the host
- Evaluation
- effectiveness of deduplication
- removing duplicate writes, extending flash space
- performance impact
- cache size, hashing speed, fingerprint searching
- acceleration methods
- sampling, light-weight pre-hashing, dynamic switch

## 2. Strength (Contributions of the paper)

- good evaluation

## 3. Weakness (Limitations of the paper)

- the implementation is based on a simulator

## 4. Some Insights (Future work)

- SSD background
- An erase block usually consists of 64-128 pages, each page has a data area (e.g., 4KiB)
- read and write are performed in units of **pages**, and erase clears all the pages in an **erase block**
- Rule:
- No in-place overwrite: the whole block must be erased before writing any page in this block
- No random writes: the pages in an erase block must be written sequentially
- Limited erase/program cycles
- Flash Translation Layer (FTL)
- emulate a hard disk drive by exposing an array of logical block addresses (LBAs) to the host
- **Indirect mapping**: track the dynamic mapping between logical block addresses (LBAs) and physical block addresses (PBAs)
- **Log-like write mechanism**: the new content data is **appended sequentially in a clean erase block**, like a log
- **Garbage collection**: periodically to consolidate the valid pages into a new erase block, and clean the old erase block
- **Wear-leveling**: tracks and shuffles hot/cold data to even out writes in flash memory
- **Over-provisioning**: In order to assist garbage collection and wear-leveling
- include a certain amount of over-provisioned spare flash memory space
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
typora-copy-images-to: ../paper_figure
---
Using Hints to Improve Inline Block-Layer Deduplication
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| FAST'16 | Deduplication |
[TOC]

## 1. Summary
### Motivation of this paper
Important information about data context (e.g. data vs. metadata writes) is lost at the block layer.
> This paper argues passing such context to the block layer can help improve the deduplication performance and reliability.
> The root cause: the **semantic divide** between the block layer and file systems
![1560135727490](../paper_figure/1560135727490.png)

This paper proposes to design the interface in block-layer deduplication system, which can allow upper storage layers to pass hints based on the available context.
> Most of existing deduplication solutions are built into file systems because they have enough information to deduplicate efficiently without jeopardizing reliability. This information can be leveraged to avoid deduplicating certain blocks (e.g., metadata)
### Block-layer deduplication hints
- Hints
Hinting asks higher layers to provide small amounts to extra information to the deduplication. In this paper, it uses hinting to recover context at the block layer.

- Two main advantages in block-layer deduplication
1. allowing nay file system and application to benefit from deduplication
2. ease of implementation

- Potential Hints
1. Bypass deduplication (**NODEDUP**)
Main idea: some writes are known a priori to be likely to be **unique**. Attempting to deduplicate unique writes wastes CPU time on hash computation and I/O bandwidth on maintaining the hash index.
> application requirement: generate data should not be duplicated. (random data or encrypted data)
> Overhead: hash computation, index size, more RAM space, more lookup bandwidth.
> main issue: unique data and reliability

For **metadata**:
Most file system metadata is unique
> metadata writes are more important to overall system performance than data writes becasue the former are oftern synchronous.
> add deduplication to metadata might increase the latency of those critical metadata writes.
> reliability: duplicates metadata to avoid corruption.
2. Prefetch hashes (**PREFETCH**)
When a block layer deduplication knows what data is about to be written, it can prefetch the corresponding hashes from the index
> accelerating future data writes by reducing lookup delays.
> inform the deduplication system of I/O operations that are likely to generate further duplicates (copy file)
> their hashes can be prefetched and cached to minimize random accesses.
3. Other
Bypass compression, cluster hashes, partitioned hash index, intelligent chunking
> cluster hashes: files that reside in the same directory tend to be accessed together.

### Implementation and Evaluation
- Implementation
Modify the write path and read path.
![1560174931404](../paper_figure/1560174931404.png)
![1560175081156](../paper_figure/1560175081156.png)

- Evaluation
1. NODEDUP: observe the elapsed time in four file systems
> no-hint v.s. hint-on
2. PREFETHCH: observe the elapsed time in four file systems
> no-hint v.s. hint-on
3. Throughput: using Filebench

## 2. Strength (Contributions of the paper)
1. This paper states that if a block-level deduplication system can know when it is unwise to deduplicate a write, it can optimize its performance and reliability.
2. This method can be useful when writing unique data (i.e., avoid wastage of resources) or need to store duplicate chunks for reliability.

## 3. Weakness (Limitations of the paper)
1. In my opinion, the idea in this paper is just to deliever the context information to the block-layer to let block-layer deduplication do better.

## 4. Future Works
1. This work mentions that its initial experiments result is successful, and it can add more hints to provide more information in block-layer
> provide richer context to the block layer.
Loading

0 comments on commit 76151d3

Please sign in to comment.