update

yzr95924 · Jun 5, 2021 · 7d94eaa · 7d94eaa
1 parent 26d9a25
commit 7d94eaa
Show file tree

Hide file tree

Showing 13 changed files with 166 additions and 211 deletions.
diff --git a/README.md b/README.md
@@ -3,17 +3,6 @@
 In this repo, it records some paper related to storage system, including **Data Deduplication** (aka, dedup), **Erasure Coding** (aka, EC), general **Distributed Storage System** (aka, DSS) and other related topics (i.e., Network Security.....), updating from time to time~
 [TOC]
 
-
-
-| Type                    | Paper Amount |
-| ----------------------- | ------------ |
-| A. Data Deduplication   | 83           |
-| B. Erasure Coding       | 37           |
-| C. Security and Privacy | 12           |
-| D. Other                | 7            |
-
-
-
 ## A. Data Deduplication
 
 ### Summary
@@ -22,6 +11,7 @@ In this repo, it records some paper related to storage system, including **Data
 3. *A Survey of Secure Data Deduplication Schemes for Cloud Storage Systems*----ACM Computing Surveys'17 ([link](https://dl.acm.org/citation.cfm?id=3017428))
 4. *A Survey of Classification of Storage Deduplication Systems*----ACM Computing Surveys'14 ([link](https://dl.acm.org/citation.cfm?id=2611778))
 5. *Understanding Data Deduplication Ratios*----SNIA'08 ([link](https://www.snia.org/sites/default/files/Understanding_Data_Deduplication_Ratios-20080718.pdf))
+6. *Backup to the Future: How Workload and Hardware Changes Continually Redefine Data Domain File Systems*----IEEE Computer'17 ([link](https://ieeexplore.ieee.org/abstract/document/7971884))
 
 ### Workload Analysis
 1. *Characteristics of Backup Workloads in Production Systems*----FAST'12 ([link](http://www.usenix.net/legacy/events/fast12/tech/full_papers/Wallace2-9-12.pdf)) [summary](https://yzr95924.github.io/paper_summary/BackupWorkloads-FAST'12.html)
@@ -65,7 +55,7 @@ In this repo, it records some paper related to storage system, including **Data
 6. *Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-cao.pdf)) [summary](https://yzr95924.github.io/paper_summary/LookBackWindow-FAST'19.html)
 7. *Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication*---FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final124.pdf)) [summary](https://yzr95924.github.io/paper_summary/ImproveRestore-FAST'13.html)
 8. *Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage*----HPCC'11 
-9. *Improving the Restore Performance via Physical Locality Middleware for Backup Systems*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425691))
+9. *Improving the Restore Performance via Physical Locality Middleware for Backup Systems*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425691)) [summary](https://yzr95924.github.io/paper_summary/HiDeStore-Middleware'20.html)
 10. Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage----ToS'14 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/tos14revdedup.pdf))
 
 ### Secure Deduplication
@@ -241,9 +231,7 @@ In this repo, it records some paper related to storage system, including **Data
 ### Secret Sharing
 1. *How to Best Share a Big Secret*----SYSTOR'18 ([link](http://www.systor.org/2018/pdf/systor18-24.pdf)) [summary](https://yzr95924.github.io/paper_summary/ShareBigSecret-SYSTOR'18.html)
 2. *AONT-RS: Blending Security and Performance in Dispersed Storage Systems*----FAST'11
-3. *Secure Deletion for a Versioning File System*----FAST'05 
-4. *Splinter: Practical Private Queries on Public Data*----NSDI'17
-5. *On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution*----S&P'04
+3. *Splinter: Practical Private Queries on Public Data*----NSDI'17
 
 ### Data Encryption
 
@@ -288,6 +276,7 @@ In this repo, it records some paper related to storage system, including **Data
 9. *Regaining Lost Seconds: Efficient Page Preloading for SGX Enclaves*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425673))
 10. *Everything You Should Know About Intel SGX Performance on Virtualized Systems*----Sigmeterics'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3322205.3311076)) [summary](https://yzr95924.github.io/paper_summary/SGXPerformance-SIGMETRICS'19.html)
 11. *A Comparison Study of Intel SGX and AMD Memory Encryption Technology*---HASP'18 ([link](https://dl.acm.org/doi/abs/10.1145/3214292.3214301))
+12. *SGXoMeter: Open and Modular Benchmarking for Intel SGX*----EuroSec'21 ([link](https://www.ibr.cs.tu-bs.de/users/mahhouk/papers/eurosec2021.pdf))
 
 ### SGX Storage
 
@@ -296,12 +285,12 @@ In this repo, it records some paper related to storage system, including **Data
 3. *EnclaveDB: A Secure Database using SGX*----S&P'18 ([link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8418608))
 4. *Isolating Operating System Components with Intel SGX*----SysTEX'16 ([link](https://faui1-files.cs.fau.de/filepool/projects/sgx-kernel/sgx-kernel.pdf))
 5. *SPEICHER: Securing LSM-based Key-Value Stores using Shielded Execution*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-bailleu.pdf)) [summary](https://yzr95924.github.io/paper_summary/SPEICHER-FAST'19.html)
-6. *ShieldStore: Shielded In-memory Key-Value Storage with SGX*----EUROSYS'19 ([link]( http://calab.kaist.ac.kr:8080/~jhuh/papers/kim_eurosys19_shieldst.pdf )) [summary](https://yzr95924.github.io/paper_summary/ShieldStore-EuroSys'19.html)
+6. *ShieldStore: Shielded In-memory Key-Value Storage with SGX*----EuroSys'19 ([link]( http://calab.kaist.ac.kr:8080/~jhuh/papers/kim_eurosys19_shieldst.pdf )) [summary](https://yzr95924.github.io/paper_summary/ShieldStore-EuroSys'19.html)
 7. SeGShare: Secure Group File Sharing in the Cloud using Enclaves----DSN'20 ([link](http://www.fkerschbaum.org/dsn20.pdf)) [summary](https://yzr95924.github.io/paper_summary/SeGShare-DSN'20.html)
 8. *DISKSHIELD: A Data Tamper-Resistant Storage for Intel SGX*----AsiaCCS'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3320269.3384717))
 9. *SPEED: Accelerating Enclave Applications via Secure Deduplication*----ICDCS'19 ([link](https://conferences.computer.org/icdcs/2019/pdfs/ICDCS2019-49XpIlu3rRtYi2T0qVYnNX/5DGHpUvuZKbyIr6VRJc0zW/5PfoKBVnBKUPCcy8ruoayx.pdf)) [summary](https://yzr95924.github.io/paper_summary/SPEED-ICDCS'19.html)
 12. *Secure In-memory Key-Value Storage with SGX*----SoCC'18
-13. *EnclaveCache: A Secure and Scalable Key-value Cache in Multi-tenant Clouds using Intel SGX*----Middleware'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3361525.3361533) [summary](https://yzr95924.github.io/paper_summary/EnclaveCache-Middleware'19.html)
+13. *EnclaveCache: A Secure and Scalable Key-value Cache in Multi-tenant Clouds using Intel SGX*----Middleware'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3361525.3361533)) [summary](https://yzr95924.github.io/paper_summary/EnclaveCache-Middleware'19.html)
 
 ### Network Security
 
@@ -334,9 +323,19 @@ In this repo, it records some paper related to storage system, including **Data
 ### Hash
 1. *Compare-by-Hash: A Reasoned Analysis*----USENIX ATC'06 ([link](https://www.usenix.org/legacy/event/usenix06/tech/full_papers/black/black.pdf)) [summary](https://yzr95924.github.io/paper_summary/CompareByHash-ATC'06.html)
 2. *An Analysis of Compare-by-Hash*----HotOS'03 ([link](http://www.cs.utah.edu/~shanth/stuff/research/dup_elim/hash_cmp.pdf))
+3. *On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution*----S&P'04 ([link](https://pdos.csail.mit.edu/papers/otfvec/paper.pdf))
 
 ### Streaming Process
 1. *A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring*----IPDPS'10
 
+### Continuous Data Protection & Versioning 
+
+1. *Design and Implementation of Verifiable Audit Trails for a Versioning File System*----FAST'07 ([link](https://static.usenix.org/event/fast07/tech/full_papers/peterson/peterson.pdf))
+2. *Architectures for controller based CDP*----FAST'07 ([link](https://static.usenix.org/events/fast07/tech/full_papers/laden/laden.pdf))
+3. *File Versioning for Block-Level Continuous Data Protection*----ICDCS'07
+4. *Cloud object storage based Continuous Data Protection(cCDP)*----NAS'15
+5. Secure Deletion for a Versioning File System----FAST'05 ([link](https://static.usenix.org/events/fast05/tech/full_papers/peterson/peterson.pdf))
+6. *Secure File System Versioning at the Block Level*----EuroSys'07 ([link](https://dl.acm.org/doi/pdf/10.1145/1272996.1273018))
+
 
 
diff --git a/StoragePaperNote/iDedup-FAST'12.md → ...p-FAST'12-zryang-HP-Z440-Workstation-2.md b/StoragePaperNote/iDedup-FAST'12.md → ...p-FAST'12-zryang-HP-Z440-Workstation-2.md
diff --git a/...plication-System-Design/iDedup-FAST'12.md → ...dup-FAST'12-zryang-HP-Z440-Workstation.md b/...plication-System-Design/iDedup-FAST'12.md → ...dup-FAST'12-zryang-HP-Z440-Workstation.md
diff --git a/StoragePaperNote/Deduplication/Restore-Performance/HiDeStore-Middleware'20.md b/StoragePaperNote/Deduplication/Restore-Performance/HiDeStore-Middleware'20.md
@@ -0,0 +1,150 @@
+---
+typora-copy-images-to: ../paper_figure
+---
+Improving the Restore Performance via Physical-Locality Middleware  for Backup Systems
+------------------------------------------
+|           Venue            |       Category       |
+| :------------------------: | :------------------: |
+| Middleware'20 | Deduplication Restore |
+[TOC]
+
+## 1. Summary
+
+### Motivation of this paper
+- Motivation
+  - Deduplication suffers from low restore performance, since the chunks are heavily `fragmented`
+    - have to read a large number of containers to obtain all chunks
+      - incurring lots of expensive I/Os to the persistent storage
+- Limitations of existing work
+  - caching scheme
+    - fail to alleviate the fragmentation problem since the chunks are stored into more containers 
+      - as the backup data increase
+  - rewriting scheme
+    - the deduplication ratio decreases due the existence of duplicate chunks 
+- Open-source
+  - https://github.com/iotlpf/HiDeStore
+
+### HiDeStore
+
+- Main insight
+  - identify the chunks that are more likely to be shared by the subsequent backup versions
+    - e.g., hot chunks 
+    - store those hot chunks together to enhance the physical locality for the new backup versions
+  - other chunks (i.e., have the low probability to be shared by the new backup versions) are stored in the containers as the traditional deduplication schemes 
+    - e.g., cold chunks
+
+- Trace analysis
+  - Linux kernal, gcc, fslhomes, and macos
+    - heuristic experiment is conducted on `Destor`
+  - version tag
+    - indicates the backup version recently containing the chunk 
+      - the chunk that are not contained in the new backup versions will always maintain the old version tags
+  - Finding
+    - the chunk not appearing in the current backup version have a **low probability** to appear in the subsequent backup versions
+      - reason: the new backup version is generated via upgrading the old ones
+
+- Key idea:
+  - a **reverse** online deduplication process
+    - the fragmented chunks are generated in the old backup versions rather than the new ones 
+  - only search the chunks with high probability to be deduplicated with coming chunks 
+    - i.e., the chunks in previous backup versions
+  - classify and respectively store the hot and cold chunks
+    - groups the chunks of new backup version closely
+
+- Fingerprint cache with double hash
+  - **only search the fingerprint cache** without further searching the full index table on disks
+  - the fingerprint cache mainly contains the chunks with a high duplicate probability
+
+  ![image-20210604111024132](../paper_figure/image-20210604111024132.png)
+
+  - T1: contains the metadata of the chunks in previous version
+
+  - T2: used to contain the chunks of current version 
+  - After deduplication against current version, the chunks not appearing in current version are left in T1
+    - i.e., the cold chunks
+    - move the cold chunks from active containers to archival containers after deduplicating current version and updates the recipes
+  - index overhead
+    - the sizes of T1 and T2 are bounded and hardly full
+      - the total size of hash table is limited to the `size of one (or two) backup versions`
+      - achieves almost the same deduplication ratio like the exact deduplication
+        - `still has the extre storage overhead`
+
+- Chunk filter to separate chunks 
+
+  - Separately stores the hot and cold chunks by changing the **writing paths** of the chunks
+  - Container structure
+    - metadata section: container ID,  data size, `hash table` (key: fingerprint, value: a pointer to the corresponding chunks)
+    - data section: the real data of the contained chunks
+  - the hot chunks are temporarily stored in **active** containers during the deduplication phase
+    - need to compact the active containers to enhance the physical locality, directly merge sparse containers in to the same container **without considering the order**
+      - since move cold chunks to the archival containers
+
+- Update recipes
+
+  - all recipes form a **chain**
+    - need to read **multiple recipes** to find the concrete location for each chunk when restoring the old backups
+      - incur `high latency` 
+  - need to update the recipes when moving the cold chunks 
+    - since the locations of the chunks are modified 
+
+  ![image-20210604120837514](../paper_figure/image-20210604120837514.png)
+
+- Restore phase
+
+  - the recipe contains three types of contain ID (CID)
+    - positive CID: archival container
+    - negative CID: the backup version
+    - 0: indicates the active container
+
+- Garbage collection
+
+  - HiDeStore has stored the chunks belonging to different backup versions **separately**
+    - without the needs for expensive chunk detection and 
+
+### Implementation and Evaluation
+
+- Evaluation
+  - Restore baseline: capping, ALACC, and FBW
+  - Deduplication performance baseline: DDFS, Sparse Index, and Silo
+  - Trace: kernel, gcc, fslhomes and macos
+- Deduplication ratio
+  - HiDeStore achieves almost the same deduplication ratio as DDFS
+- Deduplication throughput
+  - Destor evaluates the number of the **request for the full index table**
+    - `is not absolute throughput` (simulate the requests to disks)
+  - HiDeStore only needs to search the fingerprint cache without the needs to frequently access the full index table on the disk
+    - use prefetching to load the index into the fingerprint cache in advance 
+- Space consumption for index table
+  - it does not need extra space to store the indexes, since HiDeStore deduplicates one backup version against its previous one
+    - the fingerprint indexes of all chunks in previous backup version have been stored in the recipe
+- Restore performance
+  - metric: the mean data size that is restored per container 
+- HiDeStore overhead
+  - updating recipes
+  - moving the chunks from active containers to archival containers 
+
+- The deletion
+  - HiDeStore does not need any efforts for GC
+
+## 2. Strength (Contributions of the paper)
+
+- propose the idea to separately store the hot chunks and cold chunks
+  - for each backup
+
+## 3. Weakness (Limitations of the paper)
+
+- this work only consider the case for a single client backup
+  - the assumption is weak from my view, how about multiple clients' backups
+
+## 4. Some Insights (Future work)
+
+- recipe background
+  - the data structure of the recipe is a chunk **list**, and each item contains:
+    - a chunk fingerprint (20 bytes)
+    - the ID of the container (4 bytes)
+    - the offset in the container (4 bytes)
+  - assemble the data stream in memory in the chunk-by-chunk manner
+
+- Workload dependence
+  - It can simply `trace the chunk distribution` among versions and determine whether to use the proposed scheme
+    - the overhead is low, since it only needs to trace the metadata of the chunks
diff --git a/StoragePaperNote/EnclaveDB-S&P'18.md b/StoragePaperNote/EnclaveDB-S&P'18.md
diff --git a/StoragePaperNote/LongTermAnalysis-MSST'16.md b/StoragePaperNote/LongTermAnalysis-MSST'16.md
diff --git a/StoragePaperNote/PRO-ORAM-RAID'19.md b/StoragePaperNote/PRO-ORAM-RAID'19.md