update

yzr95924 · Sep 23, 2020 · 66ce89e · 66ce89e
1 parent 1640a20
commit 66ce89e
Show file tree

Hide file tree

Showing 17 changed files with 226 additions and 226 deletions.
diff --git a/README.md b/README.md
@@ -66,7 +66,7 @@ In this repo, it records some paper related to storage system, including **Data
 ### Secure Deduplication
 1. *Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds*----HotStorage'14 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/hotstorage14.pdf)) [summary](https://yzr95924.github.io/paper_summary/CAONT-RS-HotStorage'14.html)
 2. *CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via Convergent Dispersal*----USENIX ATC'15 ([link](https://www.usenix.org/system/files/conference/atc15/atc15-paper-li-mingqiang.pdf)) [summary](https://yzr95924.github.io/paper_summary/CDStore-ATC'15.html)
-3. *Information Leakage in Encrypted Deduplication via Frequency Analysis*----DSN'17
+3. *Information Leakage in Encrypted Deduplication via Frequency Analysis*----DSN'17 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/dsn17.pdf))
 4. *DupLESS: Server-Aided Encryption for Deduplicated Storage*----USENIX Security'13 ([link](https://eprint.iacr.org/2013/429.pdf)) [summary](https://yzr95924.github.io/paper_summary/DupLESS-Security'13.html)
 5. *Side Channels in Cloud Services, the Case of Deduplication in Cloud Storage*----S&P'10 ([link](http://www.pinkas.net/PAPERS/hps.pdf)) [summary](https://yzr95924.github.io/paper_summary/SideChannel-S&P'10.html)
 6. *Side Channels in Deduplication: Trade-offs between Leakage and Efficiency*----AsiaCCS'17 ([link](https://dl.acm.org/doi/abs/10.1145/3052973.3053019)) [summary](https://yzr95924.github.io/paper_summary/SideChannelTradeOffs-AsiaCCS'17.html)
@@ -141,6 +141,7 @@ In this repo, it records some paper related to storage system, including **Data
 7. *MUCH: Multi-threaded Content-Based File Chunking*----TC'15
 8. *Multi-Level Comparison of Data Deduplication in a Backup Scenario*----SYSTOR'09
 9. *A Framework for Analyzing the Improving Content-Based Chunking Algorithms*----HP Technique Report'05
+10. *FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication*----USENIX ATC'16 ([link](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)) [summary](https://yzr95924.github.io/paper_summary/FastCDC-ATC'16.html)
 
 ### Cache Deduplication
 
@@ -149,11 +150,12 @@ In this repo, it records some paper related to storage system, including **Data
 3. *Nitro: A Capacity-Optimized SSD Cache for Primary Storage*----USENIX ATC'14 ([link](https://www.usenix.org/system/files/conference/atc14/atc14-paper-li_cheng_nitro.pdf))
 
 ### Benchmark
+
 1. *SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-gracia-tinedo.pdf))
 
 ### Garbage Collection
 
-1. *Memory Efficient Sanitization of a Deduplicated Storage System*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final100_0.pdf))
+1. *Memory Efficient Sanitization of a Deduplicated Storage System*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final100_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/MemorySanitization-FAST'13.html)
 2. *Accelerating Restore and Garbage Collection in Deduplication-based Backup System via Exploiting Historical Information*----USENIX ATC'14 ([link](https://pdfs.semanticscholar.org/9b8d/a007a6801c9f96784dc7bc839794cb0db3ad.pdf)) [summary]( https://yzr95924.github.io/paper_summary/AcceleratingRestore-ATC'14.html )
 3. The Logic of Physical Garbage Collection in Deduplicating Storage----FAST'17 ([link](https://www.usenix.org/system/files/conference/fast17/fast17-douglis.pdf))
 4. Concurrent Deletion in a Distributed Content-addressable Storage System with Global Deduplication----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final91.pdf))
@@ -223,11 +225,12 @@ In this repo, it records some paper related to storage system, including **Data
 3. *Latency Reduction and Load Balancing in Coded Storage Systems*----SoCC'17
 
 ## C. Security
+
 ### Survey
 1. *A Survey on Systems Security Metrics*----ACM Computing Surveys'16
 
 ### Secret Sharing
-1. *How to Best Share a Big Secret*----SYSTOR'18
+1. *How to Best Share a Big Secret*----SYSTOR'18 ([link](http://www.systor.org/2018/pdf/systor18-24.pdf)) [summary](https://yzr95924.github.io/paper_summary/ShareBigSecret-SYSTOR'18.html)
 2. *AONT-RS: Blending Security and Performance in Dispersed Storage Systems*----FAST'11
 3. *Secure Deletion for a Versioning File System*----FAST'05 
 4. *Splinter: Practical Private Queries on Public Data*----NSDI'17
@@ -236,7 +239,7 @@ In this repo, it records some paper related to storage system, including **Data
 ### Data Encryption
 
 1. *Differentially Private Access Patterns for Searchable Symmetric Encryption*----INFOCOM'18 [summary](https://yzr95924.github.io/paper_summary/DifferentialPrivacy-INFOCOM'18.html)
-2. *Frequency-Hiding Order-Preserving Encryption*----CCS'15
+2. *Frequency-Hiding Order-Preserving Encryption*----CCS'15 ([link](https://dl.acm.org/doi/10.1145/2810103.2813629))
 3. *RAPPOR: Randomized Aggregable Privacy-Preserving Ordinal Response*----CCS'14
 4. *Privacy at Scale: Local Differential Privacy in Practice*----SIGMOD'18
 5. *Frequency-smoothing Encryption: Preventing Snapshot Attacks on Deterministically Encrypted Data*----IACR'17 [summary](https://yzr95924.github.io/paper_summary/FrequencySmoothing-ICAR'17.html)
@@ -252,7 +255,7 @@ In this repo, it records some paper related to storage system, including **Data
 15. *Oblivious RAM as a Substrate for Cloud Storage - The Leakage Challenge Ahead*----CCSW'16 ([link](https://dl.acm.org/citation.cfm?id=2996430)) [summary](https://yzr95924.github.io/paper_summary/ORAM-CCSW'16.html)
 16. *Oblivious RAM: A Dissection and Experimental Evaluation*---VLDB'16 ([link](http://www.vldb.org/pvldb/vol9/p1113-chang.pdf))
 17. *Splinter: Practical Private Queries on Public Data*----NSDI'17 ([link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-wang-frank.pdf))
-18. *Quantifying Information Leakage of Deterministic Encryption*----CCSW'19 ([link]( http://users.cs.fiu.edu/~mjura011/documents/2019_CCSW_Quantifying_Information_Leakage_of_Deterministic_Encryption ))
+18. *Quantifying Information Leakage of Deterministic Encryption*----CCSW'19 ([link]( http://users.cs.fiu.edu/~mjura011/documents/2019_CCSW_Quantifying_Information_Leakage_of_Deterministic_Encryption )) [summary](https://yzr95924.github.io/paper_summary/QuantifyingInformationLeakage-CCSW'19.html)
 
 ### Secure Deletion
 

diff --git a/StoragePaperNote/ChunkingAnalysisFramework-HP-TR'05.md b/StoragePaperNote/ChunkingAnalysisFramework-HP-TR'05.md
@@ -14,9 +14,11 @@ This paper proposes a framework to analyze the content-based chunking algorithms
 > focus on **stateless chunking algorithm**, do not consider the history of the sequence, or the state of a server where other versions of the sequence might be stored.
 
 **Chunking stability**: if it makes a small modification to data, turning into a new version, and apply the chunking algorithm to the new version of data
+
 > most of the chunk created for the new version are identical to the chunks created for the older version data.
 
 ### Tow Thresholds, Two Divisors Algorithm (TTTD)
+
 - Analysis on Basic Sliding Window Algorithm (BSW)
 Basic workflow of the BSW:
 there is a pre-determined integer $D$, a fixed width sliding windows is moved across the file, and at every position in the file.

diff --git a/StoragePaperNote/Deduplication/Chunking/FastCDC-ATC'16.md b/StoragePaperNote/Deduplication/Chunking/FastCDC-ATC'16.md
@@ -0,0 +1,103 @@
+---
+typora-copy-images-to: ../paper_figure
+---
+FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
+------------------------------------------
+|           Venue            |       Category       |
+| :------------------------: | :------------------: |
+| ATC'16 | chunking |
+[TOC]
+
+## 1. Summary
+
+### Motivation of this paper
+- Motivation
+  - Existing CDC-based chunking introduces heavy CPU overhead 
+    - declare the chunk cut-points by computing and judging the rolling hashes of the data stream **byte-by-byte**.
+    - Two parts: hashing and judging the cutpoints
+  - By using Gear function, the bottleneck has shifted to the hash judgment stage.
+
+### FastCDC
+
+- Three key designs
+  - Simplified but enhanced hash judgment
+    - padding several zero bits into the mask 
+  - Sub-minimum chunk cut-point skipping
+    - enlarge the minimum chunk size to maximize the chunking speed 
+  - Normalized chunking
+    - normalizes the chunk size distribution to a small specified region
+      - increase the deduplication ratio 
+      - reduce the number of small-sized chunks (can combine with the cut-point skipping technique above to maximize the CDC speed while without sacrificing the deduplication ratio.)
+
+- Gear hash function 
+  - an array of 256 random 64-bit integers to map the values of the byte contents in the sliding window.
+  - using only three operations (i.e., +, <<, and an array lookup)
+    - enabling it to move quickly through the data content for the purpose of CDC.
+
+![image-20200918215222896](../paper_figure/image-20200918215222896.png)
+![image-20200918215434370](../paper_figure/image-20200918215434370.png)
+
+- Optimizing hash judgement
+  - Gear-based CDC employs the same conventional hash judgment used in the Rabin-based CDC
+    - A certain number of the lowest bits of the fingerprint are used to declare the chunk cut-point.
+  - FastCDC enlarges the sliding window size by padding a number of zero bits into the mask value
+    - change the hash judgment statement
+    - involve more bytes in the final hash judgment 
+      - minimizing the probability of chunking position collision
+  - simplifying the hash judgment to accelerate CDC
+    - in Rabin: fp mod D == r
+    - in FastCDC: fp & Mask == 0 --> !fp & Mask 
+    - avoid the unnecessary comparison operation 
+
+- Cut-point skipping
+  - avoid the operations for hash calculation and judgment in the skipped region.
+    - may reduce the deduplication ratio.
+  - the cumulative distribution of chunk size in Rabin-based CDC (without the maximum and minimum chunk size requirements) follows **an exponential distribution**.
+
+- Normalized chunking
+  - solve the problem of decreased deduplication ratio facing the cut-point skipping approach.
+  - After normalized chunking, there are almost no chunks of size smaller than the minimum chunk size 
+  - By changing the number of '1' bits in FastCDC, the chunk-size distribution will be approximately normalized to a specific region (always larger than the minimum chunk size, instead of following the exponential distribution)
+    - define two masks
+    - more effective mask bits: increase chunk size
+    - fewer effective mask bits: reduce chunk size
+![image-20200918224123859](../paper_figure/image-20200918224123859.png)
+
+- The whole algorithm
+
+![image-20200919014910600](../paper_figure/image-20200919014910600.png)
+
+### Implementation and Evaluation
+
+- Evaluation standard
+  - deduplication ratio
+  - chunking speed
+  - the average generated chunk size
+
+- Compared with 
+  - FastCDC
+  - Gear-based
+  - AE-based
+  - Rabin-based
+
+- Evaluation of optimizing hash judgement
+- Evaluation of cut-point skipping
+- Evaluation of normalized chunking
+- Comprehensive evaluation of FastCDC
+
+## 2. Strength (Contributions of the paper)
+1. propose a new chunking algorithm with three new designs 
+> enhanced hash judgment
+> sub-minimum chunk cut-point skipping
+> normalized chunking 
+
+
+## 3. Weakness (Limitations of the paper)
+
+1. the part of cut-point skipping is not clear
+
+## 4. Some Insights (Future work)
+
+1. The research direction in chunking algorithm
+> algorithmic-oriented CDC optimizations 
+> hardware-oriented CDC optimizations
diff --git a/StoragePaperNote/Deduplication/GC/MemorySanitization-FAST'13.md b/StoragePaperNote/Deduplication/GC/MemorySanitization-FAST'13.md
@@ -5,7 +5,7 @@ Memory Efficient Sanitization of a Deduplicated Storage System
 ------------------------------------------
 |           Venue            |       Category       |
 | :------------------------: | :------------------: |
-| FAST'13 | Deduplication Sanitization (EMC) |
+| FAST'13 | Deduplication Sanitization |
 [TOC]
 
 ## 1. Summary
@@ -22,13 +22,6 @@ Securely erasing sensitive data from a storage system (sanitization could requir
 > 3. the whole sanitization process is efficient
 > 4. the storage system is usable while sanitization runs
 
-- Key idea:
-multiple sanitization techniques that trade-off I/O and memory requirements
-> the tracking of references is the main problem to solve to efficiently sanitize a deduplicated storage system.
-
-Instead of needing a **dynamic** data structure that can handle insertions, it can optimize with a **static** data structure for our reference set.
-> perfect hashes
-
 ### Sanitization
 - Threat model
 1. casual attack
@@ -43,44 +36,25 @@ Instead of needing a **dynamic** data structure that can handle insertions, it c
 > require specific disk format 
 
 - Managing chunk references
-1. copy to a clean system
-copy the live data to a new storage system, and then destroy the original storage system.
-2. reference index
-maintaining correct reference counts is challenging (**live reference counts are not preferred**)
-> 1. Partitioned reference index
-> 2. Bloom filter: approximate a full index using a Bloom filter. (check whether a key is existed in a Bloom filter)
-> > enumerating the live files and inserting the fingerprints into the Bloom filter, (may consider dead chunks as alive chunks)
-
-3. Bit vector
-allocate a bit vector that is indexed by container number and offset within a container.
-> the bottleneck has moved to the construction of the bit-vector. Check the meta region of the container.
-
-4. Perfect hash vector
-**key requirement:** need a **compact** representation of a set of fingerprints that provides an **exact** answer for whether a given fingerprint exists in the set or not.
-> suppose it is a static version of the membership problem where the key space is known **beforehand**.
+1. Bloom filter
+2. Bit vector
+3. Perfect hash vector
+**key requirement:** need a compact representation of a set of fingerprints that provides an exact answer for whether a given fingerprint exists in the set or not.
+> suppose it is a static version of the membership problem where the key space is known beforehand.
 > no dynamic insertion or deletion of keys.
 
-Lookup cost and generation cost:
-> lookup cost: the number of random memory accesses needed for each lookup.
-> generation cost: a linear function on the number of fingerprints.
-
-The amount of memory will depend on the capacity of the system since the total number of fingerprints also depends on that.
-
-Using two levels hash functions to construct a perfect hash function
-
+those two points support to leverage perfect hash vector.
 
 - Sanitization process
-For read-only file system: (read-only restriction: the key space is static)
+For read-only file system:
 > 1. Merge phase: set the consistency point, flush the in-memory fingerprint index buffer and merge it with the on-disk index.
 > 2. Traverse the on-disk index for all fingerprints and build the perfect hash function for all fingerprints found
 > 3. Traverse all files and mark all fingerprints found as live in perfect hash vector
 > 4. select containers with at least one dead chunk, and copy all live chunks from the selected containers into new containers (copy forward), and delete the selected containers.
 
-![image-20200306224943511](../paper_figure/image-20200306224943511.png)
 
 ### Implementation and Evaluation
-
-- Evaluation: (using synthetic backup data set)
+- Evaluation:
   - without deduplication: exclude the deduplication impact on the sanitization process (as the baseline)
   - with deduplication: deleted space vs sanitization time
   - impact on ingests: the performance when both sanitization and data ingestion run concurrently.
@@ -91,17 +65,9 @@ For read-only file system: (read-only restriction: the key space is static)
 
 
 ## 3. Weakness (Limitations of the paper)
-1. In the enumeration phase it needs to traverses all the files and marks their fingerprints as alive in the $PH_{vec}$ structure. This time depends on the **logical size** of the system.
-
-2. The copy and zero phase are the most time-consuming ones but scale linearly with the amount of data that has been deleted.
 
 ## 4. Some Insights (Future work)
 1. In this paper, it mentions the **crypto sanitization**, which encrypts each file with a different key and throws away the key of the affected files. Is it feasible to adjust this scheme to deduplication system.
-> key management becomes a new complexity
-
 2. Here, it also uses perfect hash to represent the membership, and show it is memory efficient. How to adjust this technique to our problem?
 
 
-
-
-
diff --git a/StoragePaperNote/FrequencyAnalysis-DSN'17.md → .../Secure-Dedup/FrequencyAnalysis-DSN'17.md b/StoragePaperNote/FrequencyAnalysis-DSN'17.md → .../Secure-Dedup/FrequencyAnalysis-DSN'17.md
diff --git a/StoragePaperNote/DropboxClient-ICC'14.md b/StoragePaperNote/DropboxClient-ICC'14.md