Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Zuoru Yang committed Sep 23, 2020
1 parent 1640a20 commit 66ce89e
Show file tree
Hide file tree
Showing 17 changed files with 226 additions and 226 deletions.
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ In this repo, it records some paper related to storage system, including **Data
### Secure Deduplication
1. *Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds*----HotStorage'14 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/hotstorage14.pdf)) [summary](https://yzr95924.github.io/paper_summary/CAONT-RS-HotStorage'14.html)
2. *CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via Convergent Dispersal*----USENIX ATC'15 ([link](https://www.usenix.org/system/files/conference/atc15/atc15-paper-li-mingqiang.pdf)) [summary](https://yzr95924.github.io/paper_summary/CDStore-ATC'15.html)
3. *Information Leakage in Encrypted Deduplication via Frequency Analysis*----DSN'17
3. *Information Leakage in Encrypted Deduplication via Frequency Analysis*----DSN'17 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/dsn17.pdf))
4. *DupLESS: Server-Aided Encryption for Deduplicated Storage*----USENIX Security'13 ([link](https://eprint.iacr.org/2013/429.pdf)) [summary](https://yzr95924.github.io/paper_summary/DupLESS-Security'13.html)
5. *Side Channels in Cloud Services, the Case of Deduplication in Cloud Storage*----S&P'10 ([link](http://www.pinkas.net/PAPERS/hps.pdf)) [summary](https://yzr95924.github.io/paper_summary/SideChannel-S&P'10.html)
6. *Side Channels in Deduplication: Trade-offs between Leakage and Efficiency*----AsiaCCS'17 ([link](https://dl.acm.org/doi/abs/10.1145/3052973.3053019)) [summary](https://yzr95924.github.io/paper_summary/SideChannelTradeOffs-AsiaCCS'17.html)
Expand Down Expand Up @@ -141,6 +141,7 @@ In this repo, it records some paper related to storage system, including **Data
7. *MUCH: Multi-threaded Content-Based File Chunking*----TC'15
8. *Multi-Level Comparison of Data Deduplication in a Backup Scenario*----SYSTOR'09
9. *A Framework for Analyzing the Improving Content-Based Chunking Algorithms*----HP Technique Report'05
10. *FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication*----USENIX ATC'16 ([link](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)) [summary](https://yzr95924.github.io/paper_summary/FastCDC-ATC'16.html)

### Cache Deduplication

Expand All @@ -149,11 +150,12 @@ In this repo, it records some paper related to storage system, including **Data
3. *Nitro: A Capacity-Optimized SSD Cache for Primary Storage*----USENIX ATC'14 ([link](https://www.usenix.org/system/files/conference/atc14/atc14-paper-li_cheng_nitro.pdf))

### Benchmark

1. *SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-gracia-tinedo.pdf))

### Garbage Collection

1. *Memory Efficient Sanitization of a Deduplicated Storage System*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final100_0.pdf))
1. *Memory Efficient Sanitization of a Deduplicated Storage System*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final100_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/MemorySanitization-FAST'13.html)
2. *Accelerating Restore and Garbage Collection in Deduplication-based Backup System via Exploiting Historical Information*----USENIX ATC'14 ([link](https://pdfs.semanticscholar.org/9b8d/a007a6801c9f96784dc7bc839794cb0db3ad.pdf)) [summary]( https://yzr95924.github.io/paper_summary/AcceleratingRestore-ATC'14.html )
3. The Logic of Physical Garbage Collection in Deduplicating Storage----FAST'17 ([link](https://www.usenix.org/system/files/conference/fast17/fast17-douglis.pdf))
4. Concurrent Deletion in a Distributed Content-addressable Storage System with Global Deduplication----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final91.pdf))
Expand Down Expand Up @@ -223,11 +225,12 @@ In this repo, it records some paper related to storage system, including **Data
3. *Latency Reduction and Load Balancing in Coded Storage Systems*----SoCC'17

## C. Security

### Survey
1. *A Survey on Systems Security Metrics*----ACM Computing Surveys'16

### Secret Sharing
1. *How to Best Share a Big Secret*----SYSTOR'18
1. *How to Best Share a Big Secret*----SYSTOR'18 ([link](http://www.systor.org/2018/pdf/systor18-24.pdf)) [summary](https://yzr95924.github.io/paper_summary/ShareBigSecret-SYSTOR'18.html)
2. *AONT-RS: Blending Security and Performance in Dispersed Storage Systems*----FAST'11
3. *Secure Deletion for a Versioning File System*----FAST'05
4. *Splinter: Practical Private Queries on Public Data*----NSDI'17
Expand All @@ -236,7 +239,7 @@ In this repo, it records some paper related to storage system, including **Data
### Data Encryption

1. *Differentially Private Access Patterns for Searchable Symmetric Encryption*----INFOCOM'18 [summary](https://yzr95924.github.io/paper_summary/DifferentialPrivacy-INFOCOM'18.html)
2. *Frequency-Hiding Order-Preserving Encryption*----CCS'15
2. *Frequency-Hiding Order-Preserving Encryption*----CCS'15 ([link](https://dl.acm.org/doi/10.1145/2810103.2813629))
3. *RAPPOR: Randomized Aggregable Privacy-Preserving Ordinal Response*----CCS'14
4. *Privacy at Scale: Local Differential Privacy in Practice*----SIGMOD'18
5. *Frequency-smoothing Encryption: Preventing Snapshot Attacks on Deterministically Encrypted Data*----IACR'17 [summary](https://yzr95924.github.io/paper_summary/FrequencySmoothing-ICAR'17.html)
Expand All @@ -252,7 +255,7 @@ In this repo, it records some paper related to storage system, including **Data
15. *Oblivious RAM as a Substrate for Cloud Storage - The Leakage Challenge Ahead*----CCSW'16 ([link](https://dl.acm.org/citation.cfm?id=2996430)) [summary](https://yzr95924.github.io/paper_summary/ORAM-CCSW'16.html)
16. *Oblivious RAM: A Dissection and Experimental Evaluation*---VLDB'16 ([link](http://www.vldb.org/pvldb/vol9/p1113-chang.pdf))
17. *Splinter: Practical Private Queries on Public Data*----NSDI'17 ([link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-wang-frank.pdf))
18. *Quantifying Information Leakage of Deterministic Encryption*----CCSW'19 ([link]( http://users.cs.fiu.edu/~mjura011/documents/2019_CCSW_Quantifying_Information_Leakage_of_Deterministic_Encryption ))
18. *Quantifying Information Leakage of Deterministic Encryption*----CCSW'19 ([link]( http://users.cs.fiu.edu/~mjura011/documents/2019_CCSW_Quantifying_Information_Leakage_of_Deterministic_Encryption )) [summary](https://yzr95924.github.io/paper_summary/QuantifyingInformationLeakage-CCSW'19.html)

### Secure Deletion

Expand Down
2 changes: 2 additions & 0 deletions StoragePaperNote/ChunkingAnalysisFramework-HP-TR'05.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ This paper proposes a framework to analyze the content-based chunking algorithms
> focus on **stateless chunking algorithm**, do not consider the history of the sequence, or the state of a server where other versions of the sequence might be stored.
**Chunking stability**: if it makes a small modification to data, turning into a new version, and apply the chunking algorithm to the new version of data

> most of the chunk created for the new version are identical to the chunks created for the older version data.
### Tow Thresholds, Two Divisors Algorithm (TTTD)

- Analysis on Basic Sliding Window Algorithm (BSW)
Basic workflow of the BSW:
there is a pre-determined integer $D$, a fixed width sliding windows is moved across the file, and at every position in the file.
Expand Down
103 changes: 103 additions & 0 deletions StoragePaperNote/Deduplication/Chunking/FastCDC-ATC'16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
typora-copy-images-to: ../paper_figure
---
FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| ATC'16 | chunking |
[TOC]

## 1. Summary

### Motivation of this paper
- Motivation
- Existing CDC-based chunking introduces heavy CPU overhead
- declare the chunk cut-points by computing and judging the rolling hashes of the data stream **byte-by-byte**.
- Two parts: hashing and judging the cutpoints
- By using Gear function, the bottleneck has shifted to the hash judgment stage.

### FastCDC

- Three key designs
- Simplified but enhanced hash judgment
- padding several zero bits into the mask
- Sub-minimum chunk cut-point skipping
- enlarge the minimum chunk size to maximize the chunking speed
- Normalized chunking
- normalizes the chunk size distribution to a small specified region
- increase the deduplication ratio
- reduce the number of small-sized chunks (can combine with the cut-point skipping technique above to maximize the CDC speed while without sacrificing the deduplication ratio.)

- Gear hash function
- an array of 256 random 64-bit integers to map the values of the byte contents in the sliding window.
- using only three operations (i.e., +, <<, and an array lookup)
- enabling it to move quickly through the data content for the purpose of CDC.

![image-20200918215222896](../paper_figure/image-20200918215222896.png)
![image-20200918215434370](../paper_figure/image-20200918215434370.png)

- Optimizing hash judgement
- Gear-based CDC employs the same conventional hash judgment used in the Rabin-based CDC
- A certain number of the lowest bits of the fingerprint are used to declare the chunk cut-point.
- FastCDC enlarges the sliding window size by padding a number of zero bits into the mask value
- change the hash judgment statement
- involve more bytes in the final hash judgment
- minimizing the probability of chunking position collision
- simplifying the hash judgment to accelerate CDC
- in Rabin: fp mod D == r
- in FastCDC: fp & Mask == 0 --> !fp & Mask
- avoid the unnecessary comparison operation

- Cut-point skipping
- avoid the operations for hash calculation and judgment in the skipped region.
- may reduce the deduplication ratio.
- the cumulative distribution of chunk size in Rabin-based CDC (without the maximum and minimum chunk size requirements) follows **an exponential distribution**.

- Normalized chunking
- solve the problem of decreased deduplication ratio facing the cut-point skipping approach.
- After normalized chunking, there are almost no chunks of size smaller than the minimum chunk size
- By changing the number of '1' bits in FastCDC, the chunk-size distribution will be approximately normalized to a specific region (always larger than the minimum chunk size, instead of following the exponential distribution)
- define two masks
- more effective mask bits: increase chunk size
- fewer effective mask bits: reduce chunk size
![image-20200918224123859](../paper_figure/image-20200918224123859.png)

- The whole algorithm

![image-20200919014910600](../paper_figure/image-20200919014910600.png)

### Implementation and Evaluation

- Evaluation standard
- deduplication ratio
- chunking speed
- the average generated chunk size

- Compared with
- FastCDC
- Gear-based
- AE-based
- Rabin-based

- Evaluation of optimizing hash judgement
- Evaluation of cut-point skipping
- Evaluation of normalized chunking
- Comprehensive evaluation of FastCDC

## 2. Strength (Contributions of the paper)
1. propose a new chunking algorithm with three new designs
> enhanced hash judgment
> sub-minimum chunk cut-point skipping
> normalized chunking

## 3. Weakness (Limitations of the paper)

1. the part of cut-point skipping is not clear

## 4. Some Insights (Future work)

1. The research direction in chunking algorithm
> algorithmic-oriented CDC optimizations
> hardware-oriented CDC optimizations
52 changes: 9 additions & 43 deletions StoragePaperNote/Deduplication/GC/MemorySanitization-FAST'13.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Memory Efficient Sanitization of a Deduplicated Storage System
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| FAST'13 | Deduplication Sanitization (EMC) |
| FAST'13 | Deduplication Sanitization |
[TOC]

## 1. Summary
Expand All @@ -22,13 +22,6 @@ Securely erasing sensitive data from a storage system (sanitization could requir
> 3. the whole sanitization process is efficient
> 4. the storage system is usable while sanitization runs
- Key idea:
multiple sanitization techniques that trade-off I/O and memory requirements
> the tracking of references is the main problem to solve to efficiently sanitize a deduplicated storage system.
Instead of needing a **dynamic** data structure that can handle insertions, it can optimize with a **static** data structure for our reference set.
> perfect hashes
### Sanitization
- Threat model
1. casual attack
Expand All @@ -43,44 +36,25 @@ Instead of needing a **dynamic** data structure that can handle insertions, it c
> require specific disk format
- Managing chunk references
1. copy to a clean system
copy the live data to a new storage system, and then destroy the original storage system.
2. reference index
maintaining correct reference counts is challenging (**live reference counts are not preferred**)
> 1. Partitioned reference index
> 2. Bloom filter: approximate a full index using a Bloom filter. (check whether a key is existed in a Bloom filter)
> > enumerating the live files and inserting the fingerprints into the Bloom filter, (may consider dead chunks as alive chunks)
3. Bit vector
allocate a bit vector that is indexed by container number and offset within a container.
> the bottleneck has moved to the construction of the bit-vector. Check the meta region of the container.
4. Perfect hash vector
**key requirement:** need a **compact** representation of a set of fingerprints that provides an **exact** answer for whether a given fingerprint exists in the set or not.
> suppose it is a static version of the membership problem where the key space is known **beforehand**.
1. Bloom filter
2. Bit vector
3. Perfect hash vector
**key requirement:** need a compact representation of a set of fingerprints that provides an exact answer for whether a given fingerprint exists in the set or not.
> suppose it is a static version of the membership problem where the key space is known beforehand.
> no dynamic insertion or deletion of keys.
Lookup cost and generation cost:
> lookup cost: the number of random memory accesses needed for each lookup.
> generation cost: a linear function on the number of fingerprints.
The amount of memory will depend on the capacity of the system since the total number of fingerprints also depends on that.

Using two levels hash functions to construct a perfect hash function

those two points support to leverage perfect hash vector.

- Sanitization process
For read-only file system: (read-only restriction: the key space is static)
For read-only file system:
> 1. Merge phase: set the consistency point, flush the in-memory fingerprint index buffer and merge it with the on-disk index.
> 2. Traverse the on-disk index for all fingerprints and build the perfect hash function for all fingerprints found
> 3. Traverse all files and mark all fingerprints found as live in perfect hash vector
> 4. select containers with at least one dead chunk, and copy all live chunks from the selected containers into new containers (copy forward), and delete the selected containers.
![image-20200306224943511](../paper_figure/image-20200306224943511.png)

### Implementation and Evaluation

- Evaluation: (using synthetic backup data set)
- Evaluation:
- without deduplication: exclude the deduplication impact on the sanitization process (as the baseline)
- with deduplication: deleted space vs sanitization time
- impact on ingests: the performance when both sanitization and data ingestion run concurrently.
Expand All @@ -91,17 +65,9 @@ For read-only file system: (read-only restriction: the key space is static)


## 3. Weakness (Limitations of the paper)
1. In the enumeration phase it needs to traverses all the files and marks their fingerprints as alive in the $PH_{vec}$ structure. This time depends on the **logical size** of the system.

2. The copy and zero phase are the most time-consuming ones but scale linearly with the amount of data that has been deleted.

## 4. Some Insights (Future work)
1. In this paper, it mentions the **crypto sanitization**, which encrypts each file with a different key and throws away the key of the affected files. Is it feasible to adjust this scheme to deduplication system.
> key management becomes a new complexity
2. Here, it also uses perfect hash to represent the membership, and show it is memory efficient. How to adjust this technique to our problem?





28 changes: 0 additions & 28 deletions StoragePaperNote/DropboxClient-ICC'14.md

This file was deleted.

Loading

0 comments on commit 66ce89e

Please sign in to comment.