Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
yzr95924 committed Jun 5, 2021
1 parent 26d9a25 commit 7d94eaa
Show file tree
Hide file tree
Showing 13 changed files with 166 additions and 211 deletions.
33 changes: 16 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,6 @@
In this repo, it records some paper related to storage system, including **Data Deduplication** (aka, dedup), **Erasure Coding** (aka, EC), general **Distributed Storage System** (aka, DSS) and other related topics (i.e., Network Security.....), updating from time to time~
[TOC]



| Type | Paper Amount |
| ----------------------- | ------------ |
| A. Data Deduplication | 83 |
| B. Erasure Coding | 37 |
| C. Security and Privacy | 12 |
| D. Other | 7 |



## A. Data Deduplication

### Summary
Expand All @@ -22,6 +11,7 @@ In this repo, it records some paper related to storage system, including **Data
3. *A Survey of Secure Data Deduplication Schemes for Cloud Storage Systems*----ACM Computing Surveys'17 ([link](https://dl.acm.org/citation.cfm?id=3017428))
4. *A Survey of Classification of Storage Deduplication Systems*----ACM Computing Surveys'14 ([link](https://dl.acm.org/citation.cfm?id=2611778))
5. *Understanding Data Deduplication Ratios*----SNIA'08 ([link](https://www.snia.org/sites/default/files/Understanding_Data_Deduplication_Ratios-20080718.pdf))
6. *Backup to the Future: How Workload and Hardware Changes Continually Redefine Data Domain File Systems*----IEEE Computer'17 ([link](https://ieeexplore.ieee.org/abstract/document/7971884))

### Workload Analysis
1. *Characteristics of Backup Workloads in Production Systems*----FAST'12 ([link](http://www.usenix.net/legacy/events/fast12/tech/full_papers/Wallace2-9-12.pdf)) [summary](https://yzr95924.github.io/paper_summary/BackupWorkloads-FAST'12.html)
Expand Down Expand Up @@ -65,7 +55,7 @@ In this repo, it records some paper related to storage system, including **Data
6. *Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-cao.pdf)) [summary](https://yzr95924.github.io/paper_summary/LookBackWindow-FAST'19.html)
7. *Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication*---FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final124.pdf)) [summary](https://yzr95924.github.io/paper_summary/ImproveRestore-FAST'13.html)
8. *Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage*----HPCC'11
9. *Improving the Restore Performance via Physical Locality Middleware for Backup Systems*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425691))
9. *Improving the Restore Performance via Physical Locality Middleware for Backup Systems*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425691)) [summary](https://yzr95924.github.io/paper_summary/HiDeStore-Middleware'20.html)
10. Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage----ToS'14 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/tos14revdedup.pdf))

### Secure Deduplication
Expand Down Expand Up @@ -241,9 +231,7 @@ In this repo, it records some paper related to storage system, including **Data
### Secret Sharing
1. *How to Best Share a Big Secret*----SYSTOR'18 ([link](http://www.systor.org/2018/pdf/systor18-24.pdf)) [summary](https://yzr95924.github.io/paper_summary/ShareBigSecret-SYSTOR'18.html)
2. *AONT-RS: Blending Security and Performance in Dispersed Storage Systems*----FAST'11
3. *Secure Deletion for a Versioning File System*----FAST'05
4. *Splinter: Practical Private Queries on Public Data*----NSDI'17
5. *On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution*----S&P'04
3. *Splinter: Practical Private Queries on Public Data*----NSDI'17

### Data Encryption

Expand Down Expand Up @@ -288,6 +276,7 @@ In this repo, it records some paper related to storage system, including **Data
9. *Regaining Lost Seconds: Efficient Page Preloading for SGX Enclaves*----Middleware'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3423211.3425673))
10. *Everything You Should Know About Intel SGX Performance on Virtualized Systems*----Sigmeterics'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3322205.3311076)) [summary](https://yzr95924.github.io/paper_summary/SGXPerformance-SIGMETRICS'19.html)
11. *A Comparison Study of Intel SGX and AMD Memory Encryption Technology*---HASP'18 ([link](https://dl.acm.org/doi/abs/10.1145/3214292.3214301))
12. *SGXoMeter: Open and Modular Benchmarking for Intel SGX*----EuroSec'21 ([link](https://www.ibr.cs.tu-bs.de/users/mahhouk/papers/eurosec2021.pdf))

### SGX Storage

Expand All @@ -296,12 +285,12 @@ In this repo, it records some paper related to storage system, including **Data
3. *EnclaveDB: A Secure Database using SGX*----S&P'18 ([link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8418608))
4. *Isolating Operating System Components with Intel SGX*----SysTEX'16 ([link](https://faui1-files.cs.fau.de/filepool/projects/sgx-kernel/sgx-kernel.pdf))
5. *SPEICHER: Securing LSM-based Key-Value Stores using Shielded Execution*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-bailleu.pdf)) [summary](https://yzr95924.github.io/paper_summary/SPEICHER-FAST'19.html)
6. *ShieldStore: Shielded In-memory Key-Value Storage with SGX*----EUROSYS'19 ([link]( http://calab.kaist.ac.kr:8080/~jhuh/papers/kim_eurosys19_shieldst.pdf )) [summary](https://yzr95924.github.io/paper_summary/ShieldStore-EuroSys'19.html)
6. *ShieldStore: Shielded In-memory Key-Value Storage with SGX*----EuroSys'19 ([link]( http://calab.kaist.ac.kr:8080/~jhuh/papers/kim_eurosys19_shieldst.pdf )) [summary](https://yzr95924.github.io/paper_summary/ShieldStore-EuroSys'19.html)
7. SeGShare: Secure Group File Sharing in the Cloud using Enclaves----DSN'20 ([link](http://www.fkerschbaum.org/dsn20.pdf)) [summary](https://yzr95924.github.io/paper_summary/SeGShare-DSN'20.html)
8. *DISKSHIELD: A Data Tamper-Resistant Storage for Intel SGX*----AsiaCCS'20 ([link](https://dl.acm.org/doi/pdf/10.1145/3320269.3384717))
9. *SPEED: Accelerating Enclave Applications via Secure Deduplication*----ICDCS'19 ([link](https://conferences.computer.org/icdcs/2019/pdfs/ICDCS2019-49XpIlu3rRtYi2T0qVYnNX/5DGHpUvuZKbyIr6VRJc0zW/5PfoKBVnBKUPCcy8ruoayx.pdf)) [summary](https://yzr95924.github.io/paper_summary/SPEED-ICDCS'19.html)
12. *Secure In-memory Key-Value Storage with SGX*----SoCC'18
13. *EnclaveCache: A Secure and Scalable Key-value Cache in Multi-tenant Clouds using Intel SGX*----Middleware'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3361525.3361533) [summary](https://yzr95924.github.io/paper_summary/EnclaveCache-Middleware'19.html)
13. *EnclaveCache: A Secure and Scalable Key-value Cache in Multi-tenant Clouds using Intel SGX*----Middleware'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3361525.3361533)) [summary](https://yzr95924.github.io/paper_summary/EnclaveCache-Middleware'19.html)

### Network Security

Expand Down Expand Up @@ -334,9 +323,19 @@ In this repo, it records some paper related to storage system, including **Data
### Hash
1. *Compare-by-Hash: A Reasoned Analysis*----USENIX ATC'06 ([link](https://www.usenix.org/legacy/event/usenix06/tech/full_papers/black/black.pdf)) [summary](https://yzr95924.github.io/paper_summary/CompareByHash-ATC'06.html)
2. *An Analysis of Compare-by-Hash*----HotOS'03 ([link](http://www.cs.utah.edu/~shanth/stuff/research/dup_elim/hash_cmp.pdf))
3. *On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution*----S&P'04 ([link](https://pdos.csail.mit.edu/papers/otfvec/paper.pdf))

### Streaming Process
1. *A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring*----IPDPS'10

### Continuous Data Protection & Versioning

1. *Design and Implementation of Verifiable Audit Trails for a Versioning File System*----FAST'07 ([link](https://static.usenix.org/event/fast07/tech/full_papers/peterson/peterson.pdf))
2. *Architectures for controller based CDP*----FAST'07 ([link](https://static.usenix.org/events/fast07/tech/full_papers/laden/laden.pdf))
3. *File Versioning for Block-Level Continuous Data Protection*----ICDCS'07
4. *Cloud object storage based Continuous Data Protection(cCDP)*----NAS'15
5. Secure Deletion for a Versioning File System----FAST'05 ([link](https://static.usenix.org/events/fast05/tech/full_papers/peterson/peterson.pdf))
6. *Secure File System Versioning at the Block Level*----EuroSys'07 ([link](https://dl.acm.org/doi/pdf/10.1145/1272996.1273018))



File renamed without changes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
typora-copy-images-to: ../paper_figure
---
Improving the Restore Performance via Physical-Locality Middleware for Backup Systems
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| Middleware'20 | Deduplication Restore |
[TOC]

## 1. Summary

### Motivation of this paper
- Motivation
- Deduplication suffers from low restore performance, since the chunks are heavily `fragmented`
- have to read a large number of containers to obtain all chunks
- incurring lots of expensive I/Os to the persistent storage
- Limitations of existing work
- caching scheme
- fail to alleviate the fragmentation problem since the chunks are stored into more containers
- as the backup data increase
- rewriting scheme
- the deduplication ratio decreases due the existence of duplicate chunks
- Open-source
- https://github.com/iotlpf/HiDeStore

### HiDeStore

- Main insight
- identify the chunks that are more likely to be shared by the subsequent backup versions
- e.g., hot chunks
- store those hot chunks together to enhance the physical locality for the new backup versions
- other chunks (i.e., have the low probability to be shared by the new backup versions) are stored in the containers as the traditional deduplication schemes
- e.g., cold chunks

- Trace analysis
- Linux kernal, gcc, fslhomes, and macos
- heuristic experiment is conducted on `Destor`
- version tag
- indicates the backup version recently containing the chunk
- the chunk that are not contained in the new backup versions will always maintain the old version tags
- Finding
- the chunk not appearing in the current backup version have a **low probability** to appear in the subsequent backup versions
- reason: the new backup version is generated via upgrading the old ones

- Key idea:
- a **reverse** online deduplication process
- the fragmented chunks are generated in the old backup versions rather than the new ones
- only search the chunks with high probability to be deduplicated with coming chunks
- i.e., the chunks in previous backup versions
- classify and respectively store the hot and cold chunks
- groups the chunks of new backup version closely

- Fingerprint cache with double hash
- **only search the fingerprint cache** without further searching the full index table on disks
- the fingerprint cache mainly contains the chunks with a high duplicate probability

![image-20210604111024132](../paper_figure/image-20210604111024132.png)

- T1: contains the metadata of the chunks in previous version

- T2: used to contain the chunks of current version
- After deduplication against current version, the chunks not appearing in current version are left in T1
- i.e., the cold chunks
- move the cold chunks from active containers to archival containers after deduplicating current version and updates the recipes
- index overhead
- the sizes of T1 and T2 are bounded and hardly full
- the total size of hash table is limited to the `size of one (or two) backup versions`
- achieves almost the same deduplication ratio like the exact deduplication
- `still has the extre storage overhead`

- Chunk filter to separate chunks

- Separately stores the hot and cold chunks by changing the **writing paths** of the chunks
- Container structure
- metadata section: container ID, data size, `hash table` (key: fingerprint, value: a pointer to the corresponding chunks)
- data section: the real data of the contained chunks
- the hot chunks are temporarily stored in **active** containers during the deduplication phase
- need to compact the active containers to enhance the physical locality, directly merge sparse containers in to the same container **without considering the order**
- since move cold chunks to the archival containers

- Update recipes

- all recipes form a **chain**
- need to read **multiple recipes** to find the concrete location for each chunk when restoring the old backups
- incur `high latency`
- need to update the recipes when moving the cold chunks
- since the locations of the chunks are modified

![image-20210604120837514](../paper_figure/image-20210604120837514.png)

- Restore phase

- the recipe contains three types of contain ID (CID)
- positive CID: archival container
- negative CID: the backup version
- 0: indicates the active container

- Garbage collection

- HiDeStore has stored the chunks belonging to different backup versions **separately**
- without the needs for expensive chunk detection and

### Implementation and Evaluation

- Evaluation
- Restore baseline: capping, ALACC, and FBW
- Deduplication performance baseline: DDFS, Sparse Index, and Silo
- Trace: kernel, gcc, fslhomes and macos
- Deduplication ratio
- HiDeStore achieves almost the same deduplication ratio as DDFS
- Deduplication throughput
- Destor evaluates the number of the **request for the full index table**
- `is not absolute throughput` (simulate the requests to disks)
- HiDeStore only needs to search the fingerprint cache without the needs to frequently access the full index table on the disk
- use prefetching to load the index into the fingerprint cache in advance
- Space consumption for index table
- it does not need extra space to store the indexes, since HiDeStore deduplicates one backup version against its previous one
- the fingerprint indexes of all chunks in previous backup version have been stored in the recipe
- Restore performance
- metric: the mean data size that is restored per container
- HiDeStore overhead
- updating recipes
- moving the chunks from active containers to archival containers

- The deletion
- HiDeStore does not need any efforts for GC

## 2. Strength (Contributions of the paper)

- propose the idea to separately store the hot chunks and cold chunks
- for each backup

## 3. Weakness (Limitations of the paper)

- this work only consider the case for a single client backup
- the assumption is weak from my view, how about multiple clients' backups

## 4. Some Insights (Future work)

- recipe background
- the data structure of the recipe is a chunk **list**, and each item contains:
- a chunk fingerprint (20 bytes)
- the ID of the container (4 bytes)
- the offset in the container (4 bytes)
- assemble the data stream in memory in the chunk-by-chunk manner

- Workload dependence
- It can simply `trace the chunk distribution` among versions and determine whether to use the proposed scheme
- the overhead is low, since it only needs to trace the metadata of the chunks
32 changes: 0 additions & 32 deletions StoragePaperNote/EnclaveDB-S&P'18.md

This file was deleted.

34 changes: 0 additions & 34 deletions StoragePaperNote/LongTermAnalysis-MSST'16.md

This file was deleted.

43 changes: 0 additions & 43 deletions StoragePaperNote/PRO-ORAM-RAID'19.md

This file was deleted.

Loading

0 comments on commit 7d94eaa

Please sign in to comment.