index c68bea1..957673b 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-# Storage System Paper List
+# Zuoru's Storage System Reading List
-In this repo, it records some paper related to storage system, including **Data Deduplication** (aka, dedup), **Erasure Coding** (aka, EC), general **Distributed Storage System** (aka, DSS) and other related topics (i.e., Network Security.....), updating from time to time~
+A reading list related to storage system, including data deduplication, erasure coding, general storage and other related topics (i.e., Security...), updating from time to time~
## A. Data Deduplication
@@ -27,6 +27,8 @@ In this repo, it records some paper related to storage system, including **Data
11. *Inside Dropbox: Understanding Personal Cloud Storage Services*----IMC'12
11. *Identifying Trends in Enterprise Data Protection Systems*----USENIX ATC'15 ([link](https://www.usenix.org/system/files/conference/atc15/atc15-paper-amvrosladis.pdf))
11. *Deduplication Analyses of Multimedia System Images*----HotStorage'18 ([link](https://www.usenix.org/system/files/conference/hotedge18/hotedge18-papers-suess.pdf))
+14. *Improving Docker Registry Design based on Production Workload Analysis*----FAST'18 ([link](https://www.usenix.org/system/files/conference/fast18/fast18-anwar.pdf))
+14. *Insights for Data Reduction in Primary Storage: a Practical Analysis*----SYSTOR'12 ([link](https://dl.acm.org/doi/pdf/10.1145/2367589.2367606))
### Deduplication System Design
@@ -44,15 +46,14 @@ In this repo, it records some paper related to storage system, including **Data
12. *SmartDedup: Optimizing Deduplication for Resource-constrained Devices*----USENIX ATC'19 ([link](https://www.usenix.org/system/files/atc19-yang-qirui.pdf))
13. Can't We All Get Along? Redesigning Protection Storage for Modern Workloads----USENIX ATC'18 ([link](https://www.usenix.org/system/files/conference/atc18/atc18-allu.pdf)) [summary](https://yzr95924.github.io/paper_summary/Redesigning-ATC'18.html)
14. *Deduplication in SSDs: Model and quantitative analysis*----MSST'12 ([link](https://ieeexplore.ieee.org/document/6232379))
-16. *iDedup: Latency-aware, Inline Data Deduplication for Primary Storage*----FAST'12 ([link]( https://www.usenix.org/legacy/event/fast12/tech/full_papers/Srinivasan.pdf )) [summary](https://yzr95924.github.io/paper_summary/iDedup-FAST'12.html)
-17. *DupHunter: Flexible High-Performance Deduplication for Docker Registries*----USENIX ATC'20 ([link](https://www.usenix.org/system/files/atc20-zhao.pdf))
-18. *Design Tradeoffs for Data Deduplication Performance in Backup Workloads*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-fu.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupDesignTradeoff-FAST'15.html)
-19. *The Dilemma between Deduplication and Locality: Can Both be Achieved?*---FAST'21 ([link](https://www.usenix.org/system/files/fast21-zou.pdf)) [summary](https://yzr95924.github.io/paper_summary/MFDedup-FAST'21.html)
+15. *iDedup: Latency-aware, Inline Data Deduplication for Primary Storage*----FAST'12 ([link]( https://www.usenix.org/legacy/event/fast12/tech/full_papers/Srinivasan.pdf )) [summary](https://yzr95924.github.io/paper_summary/iDedup-FAST'12.html)
+16. *DupHunter: Flexible High-Performance Deduplication for Docker Registries*----USENIX ATC'20 ([link](https://www.usenix.org/system/files/atc20-zhao.pdf))
+17. *Design Tradeoffs for Data Deduplication Performance in Backup Workloads*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-fu.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupDesignTradeoff-FAST'15.html)
+18. *The Dilemma between Deduplication and Locality: Can Both be Achieved?*---FAST'21 ([link](https://www.usenix.org/system/files/fast21-zou.pdf)) [summary](https://yzr95924.github.io/paper_summary/MFDedup-FAST'21.html)
19. *SLIMSTORE: A Cloud-based Deduplication System for Multi-version Backups*----ICDE'21 ([link](http://www.cs.utah.edu/~lifeifei/papers/slimstore-icde21.pdf))
20. *Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling*----ToS'21 ([link](https://dl.acm.org/doi/full/10.1145/3459626))
-20. *Sorted Deduplication: How to Process Thousands of Backup Streams*----MSST'16 ([link](https://storageconference.us/2016/Papers/SortedDeduplication.pdf))
-20. *Deriving and Comparing Deduplication Techniques Using a Model-Based Classification*----EuroSys'15 ([link](https://dl.acm.org/doi/pdf/10.1145/2741948.2741952))
-20. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf))
+21. *Sorted Deduplication: How to Process Thousands of Backup Streams*----MSST'16 ([link](https://storageconference.us/2016/Papers/SortedDeduplication.pdf))
+22. *Deriving and Comparing Deduplication Techniques Using a Model-Based Classification*----EuroSys'15 ([link](https://dl.acm.org/doi/pdf/10.1145/2741948.2741952))
### Restore Performances
@@ -60,7 +61,7 @@ In this repo, it records some paper related to storage system, including **Data
2. *ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching*----FAST'18 ([link](https://www.usenix.org/system/files/conference/fast18/fast18-cao.pdf)) [summary](https://yzr95924.github.io/paper_summary/ALACC-FAST'18.html)
3. *Reducing Impact of Data Fragmentation Caused by In-line Deduplication*----SYSTOR'12 ([link](http://9livesdata.com/wp-content/uploads/2017/04/AsPresentedOnSYSTOR-1.pdf))
4. *Reducing Fragmentation Impact with Forward Knowledge in Backup Systems with Deduplication*----SYSTOR'15 ([link](https://dl.acm.org/doi/10.1145/2757667.2757678))
-5. *Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets*----MASCOTS'12
+5. *Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets*----MASCOTS'12 ([link](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6298180))
6. *Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-cao.pdf)) [summary](https://yzr95924.github.io/paper_summary/LookBackWindow-FAST'19.html)
7. *Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication*---FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final124.pdf)) [summary](https://yzr95924.github.io/paper_summary/ImproveRestore-FAST'13.html)
8. *Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage*----HPCC'11
@@ -99,7 +100,7 @@ In this repo, it records some paper related to storage system, including **Data
29. *S2Dedup: SGX-enabled Secure Deduplication*----SYSTOR'21 ([link](https://dl.acm.org/doi/pdf/10.1145/3456727.3463773)) [summary](https://yzr95924.github.io/paper_summary/S2Dedup-SYSTOR'21.html)
30. *Secure Deduplication of General Computations*----USENIX ATC'15 ([link](https://www.usenix.org/system/files/conference/atc15/atc15-paper-tang.pdf))
31. *When Delta Sync Meets Message-Locked Encryption: a Feature-based Delta Sync Scheme for Encrypted Cloud Storage*----ICDCS'21 ([link](https://shenzr.github.io/publications/featuresync-icdcs21.pdf)) [summary](https://yzr95924.github.io/paper_summary/FeatureSync-ICDCS'21.html)
-31. *DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-bacs.pdf)) [summary](https://yzr95924.github.io/paper_summary/DeepSketch-FAST'22.html)
+31. *DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-bacs.pdf)) [summary](https://yzr95924.github.io/paper_summary/DUPEFS-FAST'22.html)
### Metadata Management
@@ -141,7 +142,10 @@ In this repo, it records some paper related to storage system, including **Data
13. Ddelta: A Deduplication-inspired Fast Delta Compression Approach----Performance'14 ([link](https://www.sciencedirect.com/science/article/pii/S0166531614000790))
14. *Odess: Speeding up Resemblance Detection for Redundancy Elimination by Fast Content-Defined Sampling*----ICDE'14 ([link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9458911))
15. *Exploring the Potential of Fast Delta Encoding: Marching to a Higher Compression Ratio*----CLUSTER'20 ([link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9229609)) [summary](https://yzr95924.github.io/paper_summary/Gdelta-CLUSTER'20.html)
-15. *DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-park.pdf))
+15. *DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-park.pdf)) [summary](https://yzr95924.github.io/paper_summary/DeepSketch-FAST'22.html)
+17. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html)
+17. *To Zip or not to Zip: Effective Resource Usage for Real-Time Compression*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final38.pdf)) [summary](https://yzr95924.github.io/paper_summary/CompressionEst-FAST'13.html)
+17. *Adaptively Compressing IoT Data on the Resource-constrained Edge*----HotEdge'20 ([link](https://www.usenix.org/system/files/hotedge20_paper_lu.pdf))
### Memory && Block-Layer Deduplication
@@ -151,6 +155,8 @@ In this repo, it records some paper related to storage system, including **Data
4. *OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash*----FAST'16 ([link](https://www.usenix.org/system/files/conference/fast16/fast16-papers-chen-zhuan.pdf))
5. *CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives*----FAST'11 ([link](https://www.usenix.org/legacy/event/fast11/tech/full_papers/Chen.pdf)) [summary](https://yzr95924.github.io/paper_summary/CAFTL-FAST'11.html)
5. *Remap-SSD: Safely and Efficiently Exploiting SSD Address Remapping to Eliminate Duplicate Writes*----FAST'21 ([link](https://www.usenix.org/system/files/fast21-zhou.pdf))
+7. *Memory Deduplication for Serverless Computing with Medes*----EuroSys'22 ([link](https://dl.acm.org/doi/pdf/10.1145/3492321.3524272))
+8. On the Effectiveness of Same-Domain Memory Deduplication----EuroSec'22 ([link](https://download.vusec.net/papers/dedupestreturns_eurosec22.pdf))
### Data Chunking
1. *SS-CDC: A Two-stage Parallel Content-Defined Chunking for Deduplicating Backup Storage*----SYSTOR'19 ([link]( http://ranger.uta.edu/~sjiang/pubs/papers/ni19-ss-cdc.pdf )) [summary](https://yzr95924.github.io/paper_summary/SSCDC-SYSTOR'19.html)
@@ -171,10 +177,6 @@ In this repo, it records some paper related to storage system, including **Data
3. *Nitro: A Capacity-Optimized SSD Cache for Primary Storage*----USENIX ATC'14 ([link](https://www.usenix.org/system/files/conference/atc14/atc14-paper-li_cheng_nitro.pdf))
4. *Austere Flash Caching with Deduplication and Compression*----USENIX ATC'20 ([link](https://www.usenix.org/system/files/atc20-wang-qiuping.pdf))
-### Benchmark
-1. *SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-gracia-tinedo.pdf))
### Garbage Collection
1. *Memory Efficient Sanitization of a Deduplicated Storage System*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final100_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/MemorySanitization-FAST'13.html)
@@ -194,8 +196,9 @@ In this repo, it records some paper related to storage system, including **Data
4. *Tradeoffs in Scalable Data Routing for Deduplication Clusters*----FAST'11 ([link](https://www.usenix.org/legacy/events/fast11/tech/full_papers/Dong.pdf)) [summary]( https://yzr95924.github.io/paper_summary/TradeoffDataRouting-FAST'11.html )
5. *Cluster and Single-Node Analysis of Long-Term Deduplication Patterns*----ToS'18 ([link](https://dl.acm.org/doi/pdf/10.1145/3183890)) [summary](https://yzr95924.github.io/paper_summary/ClusterSingle-ToS'18.html)
6. *Decentralized Deduplication in SAN Cluster File Systems*----USENIX ATC'09 ([link](https://static.usenix.org/events/usenix09/tech/full_papers/clements/clements.pdf))
-6. *GoSeed: Generating an Optimal Seeding Plan for Deduplicated Storage*----FAST'20 ([link](https://www.usenix.org/system/files/fast20-nachman.pdf))
-6. *The what, The from, and The to: The Migration Games in Deduplicated Systems*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-kisous.pdf))
+7. *HYDRAstore: A Scalable Secondary Storage*----FAST'09 ([link](http://9livesdata.com/wp-content/uploads/2017/04/HYDRAstor-A-Scalable-Secondary-Storage-1.pdf))
+8. *GoSeed: Generating an Optimal Seeding Plan for Deduplicated Storage*----FAST'20 ([link](https://www.usenix.org/system/files/fast20-nachman.pdf))
+9. *The what, The from, and The to: The Migration Games in Deduplicated Systems*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-kisous.pdf)) [summary](https://yzr95924.github.io/paper_summary/MigrationGame-FAST'22.html)
## B. Erasure Coding
@@ -279,6 +282,8 @@ In this repo, it records some paper related to storage system, including **Data
17. *Splinter: Practical Private Queries on Public Data*----NSDI'17 ([link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-wang-frank.pdf))
18. *Quantifying Information Leakage of Deterministic Encryption*----CCSW'19 ([link]( http://users.cs.fiu.edu/~mjura011/documents/2019_CCSW_Quantifying_Information_Leakage_of_Deterministic_Encryption )) [summary](https://yzr95924.github.io/paper_summary/QuantifyingInformationLeakage-CCSW'19.html)
18. *Pancake: Frequency Smoothing for Encrypted Data Stores*----USENIX Security'20 ([link](https://www.usenix.org/system/files/sec20-grubbs.pdf))
+19. *Hiding the Lengths of Encrypted Message via Gaussian Padding*----CCS'21 ([link](https://dl.acm.org/doi/pdf/10.1145/3460120.3484590))
+20. *On Fingerprinting Attacks and Length-Hiding Encryption*----CT-RSA'22 ([link]())
### Secure Deletion
@@ -340,8 +345,9 @@ In this repo, it records some paper related to storage system, including **Data
11. *The Google File System*----SOSP'03 ([link](https://dl.acm.org/doi/pdf/10.1145/945445.945450))
12. *Bigtable: A Distributed Storage System for Structured Data*----OSDI'06 ([link](https://dl.acm.org/doi/pdf/10.1145/1365815.1365816))
13. *Duplicacy: A New Generation of Cloud Backup Tool Based on Lock-Free Deduplication*----ToCC'20 ([link](https://github.com/gilbertchen/duplicacy/blob/master/duplicacy_paper.pdf)) [summary](https://yzr95924.github.io/paper_summary/Duplicacy-ToCC'20.html)
+13. *RACS: A Case for Cloud Storage Diversity*----SoCC'10 ([link](http://pubs.0xff.co/papers/racs-socc.pdf))
-### New PAXOS
+### Consensus
1. *In Search of an Understandable Consensus Algorithm*----USENIX ATC'14 ([link](https://raft.github.io/raft.pdf))
@@ -350,6 +356,7 @@ In this repo, it records some paper related to storage system, including **Data
1. *TinyLFU: A Highly Efficient Cache Admission Policy*----ACM ToS'17 ([link](https://arxiv.org/pdf/1512.00727.pdf))
2. *It’s Time to Revisit LRU vs. FIFO*----HotStorage'20 ([link](https://www.usenix.org/system/files/hotstorage20_paper_eytan.pdf)) [summary](https://yzr95924.github.io/paper_summary/Cache-HotStorage'20.html) [trace](http://iotta.snia.org/traces/key-value)
3. *Unifying the Data Center Caching Layer — Feasible? Profitable?*----HotStorage'21 ([link](https://dl.acm.org/doi/pdf/10.1145/3465332.3470884))
+4. *Learning Cache Replacement with Cacheus*----FAST'21 ([link](https://www.usenix.org/system/files/fast21-rodriguez.pdf))
### Hash
@@ -357,6 +364,7 @@ In this repo, it records some paper related to storage system, including **Data
2. *An Analysis of Compare-by-Hash*----HotOS'03 ([link](http://www.cs.utah.edu/~shanth/stuff/research/dup_elim/hash_cmp.pdf))
3. *On-the-Fly Verification of Rateless Erasure Codes for Efficient Content Distribution*----S&P'04 ([link](https://pdos.csail.mit.edu/papers/otfvec/paper.pdf))
4. *Algorithmic Improvements for Fast Concurrent Cuckoo Hashing*----EuroSys'14 ([link](https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf))
+4. *Don’t Thrash: How to Cache your Hash on Flash*----HotStorage'11 ([link](https://www.usenix.org/legacy/events/hotstorage11/tech/final_files/Bender.pdf))
### Lock-free storage
1. *A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring*----IPDPS'10 ([link](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ipdps10.pdf))
@@ -383,6 +391,9 @@ In this repo, it records some paper related to storage system, including **Data
1. *From blocks to rocks: a natural extension of zoned namespaces*----HotStorage'21 ([link](https://dl.acm.org/doi/pdf/10.1145/3465332.3470870))
1. *Don’t Be a Blockhead: Zoned Namespaces Make Work on Conventional SSDs Obsolete*----HotOS'21 ([link](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s07-stavrinos.pdf)) [summary](https://yzr95924.github.io/paper_summary/BlockHead-HotOS'21.html)
1. Zone Append: A New Way of Writing to Zoned Storage----Vault'20 ([link](https://www.usenix.org/system/files/vault20_slides_bjorling.pdf))
+1. *What Systems Researchers Need to Know about NAND Flash*----HotStorage'13 ([link](https://www.usenix.org/system/files/conference/hotstorage13/hotstorage13-desnoyers.pdf))
+1. *Caveat-Scriptor: Write Anywhere Shingled Disks*----HotStorage'15 ([link](https://www.usenix.org/system/files/conference/hotstorage15/hotstorage15-kadekodi.pdf))
+1. *Improving the Reliability of Next Generation SSDs using WOM-v Codes*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-jaffer.pdf))
### File system
@@ -393,8 +404,22 @@ In this repo, it records some paper related to storage system, including **Data
5. *EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices*----USENIX ATC'19 ([link](https://www.usenix.org/system/files/atc19-gao.pdf))
5. *F2FS: A New File System for Flash Storage*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-lee.pdf))
5. *How to Copy Files*----FAST'20 ([link](https://www.usenix.org/system/files/fast20-zhan.pdf))
+5. *BetrFS: A Compleat File System for Commodity SSDs*----EuroSys'22 ([link](https://dl.acm.org/doi/pdf/10.1145/3492321.3519571))
+5. *The Full Path to Full-Path Indexing*----FAST'18 ([link](https://www.usenix.org/system/files/conference/fast18/fast18-zhan.pdf))
+5. *BetrFS: A Right-Optimized Write-Optimized File System*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-jannen_william.pdf))
+11. *Filesystem Aging: It's more Usage than Fullness*----HotStorage'19 ([link](https://www.cs.unc.edu/~porter/pubs/hotstorage19-paper-conway.pdf))
+12. *File Systems Fated for Senescence? Nonsense, Says Science!*----FAST'17 ([link](https://www.usenix.org/system/files/conference/fast17/fast17-conway.pdf))
### Persistent Memories
1. *SLM-DB: Single-Level Key-Value Store with Persistent Memory*----FAST'19 ([link](https://www.usenix.org/system/files/fast19-kaiyrakhmet.pdf)) [summary](https://yzr95924.github.io/paper_summary/SLMDB-FAST'19.html)
2. *Redesigning LSMs for Nonvolatile Memory with NoveLSM*----USENIX ATC'18 ([link](https://www.usenix.org/system/files/conference/atc18/atc18-kannan.pdf)) [summary](https://yzr95924.github.io/paper_summary/NoveLSM-ATC'18.html)
+### Data Structure
+1. *An Introduction to Be-trees and Write-Optimization*----USENIX Login'15 ([link](https://www.usenix.org/system/files/login/articles/login_oct15_05_bender.pdf)) [code](https://github.com/oscarlab/Be-Tree)
+1. *Building Workload-Independent Storage with VT-Trees*----FAST'13 ([link](https://www.usenix.org/system/files/conference/fast13/fast13-final165_0.pdf))
+### Benchmark
+1. *SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks*----FAST'15 ([link](https://www.usenix.org/system/files/conference/fast15/fast15-paper-gracia-tinedo.pdf))
\ No newline at end of file
diff --git a/StoragePaperNote/Deduplication/Distributed-Dedup/MigrationGame-FAST'22.md b/StoragePaperNote/Deduplication/Distributed-Dedup/MigrationGame-FAST'22.md
new file mode 100644
index 0000000..aeabae6
--- /dev/null
+++ b/StoragePaperNote/Deduplication/Distributed-Dedup/MigrationGame-FAST'22.md
@@ -0,0 +1,108 @@
+typora-copy-images-to: ../paper_figure
+The what, The from, and The to: The Migration Games in Deduplicated Systems
+| Venue | Category |
+| :------------------------: | :------------------: |
+| FAST'22 | Distributed Deduplication |
+## 1. Summary
+### Motivation of this paper
+- motivation
+ - the high-level management aspects of large-scale systems (e.g. capacity planning, caching and cost of service) still need to be adapted to deduplication storage
+ - data migration: file are remapped between separate **deduplication domains**, or **volumes**
+ - volumes: a single server within a large-scale system, or an independent set of servers dedicated to a customer or dataset
+ - employ a separate fingerprint index in each physical server
+ - optimize several possibly conflicting objectives
+ - the physical size of the stored data (after migration)
+ - the load balancing between the system's volumes
+ - the network bandwidth generated by the migration
+- the main goal
+ - formulate the general migration problem for deduplicated systems as an optimization problem
+ - minimize the system's size
+ - ensuring that the storage load is evenly distributed between the system's volumes (**load balancing** consideration)
+ - the network traffic required for the migration does not exceed its allocation (**traffic** consideration)
+### Migration Games
+- problem statement
+ - minimizing migration traffic
+ - the amount of data that is transferred between volumes during migration
+ - load balancing
+ - trade-off between minimizing the total physical data size and maximizing load balancing
+ - extreme case: map all files to a single volume
+ - evenly distribute the capacity load between volumes
+ - use fairness metric: the ratio between the size of the smallest volume in the system and that of the largest volume (perfect: 1)
+ - traffic constraint, load balancing constraint
+ - traffic constraint: the maximum traffic allowed during migration
+ - load balancing constraint: a margin of the average volume size
+- Greedy (extend SketchVolume)
+ - iterates over all the files in each volume, and calculates the space-saving ratio from remapping a single file to each of the other volumes
+ - each phase is allocated an even portion of the traffic allocated for migration
+ - load-balancing step
+ - remap files from large volumes to small ones, until the volume sizes are within the margin defined for this phase
+ - capacity-reduction step
+ - use **remaining traffic** to reduce the system's size
+- ILP (extend GoSeed)
+ - all varaibles are boolean
+ - objective: maximize the sum of sizes of all blocks that are deleted minus all blocks that are copied
+ - acceleration methods
+ - fingerprint sampling: k leading zeroes, reducing the number of blocks in the problem
+ - solver timeout: halts the ILP solver's execution after a pre-determined runtime
+- Clustering
+ - main idea: files are similar if they are share a large portion of their blocks
+ - create clusters of similar files and to assign each cluster to a volume
+ - remapping those files that were assigned to a volume different from their original location
+ - hierarchical clustering
+ - in each iteration, merge the most similar pair of clusters into a new cluster
+ - file similarity
+ - use Jaccard index for shared blocks
+ - traffic and load-balancing consideration
+ - determine the maximal cluster size by estimating the system's size after migration
+ - sensitivity to sample
+ - rather than merging the pair of clusters with the smallest distance, we merge a **random** pair from the set of pairs with the smallest distances
+ - constructing the final migration plan
+ - for the same given system and migration constraints, execute the clustering process with different parameters, use the best deletion as the final result
+### Implementation and Evaluation
+- trace:
+ - MS, FSL, Linux (all of them are public)
+- evaluation
+ - basic comparison between algorithms
+ - the deletion percentage of the initial system's physical size
+ - balance score
+ - the total runtime
+ - sensitivity to problem parameters
+ - effect of sampling degree
+ - effect of load balancing and traffic constraints
+ - effect of randomization on Cluster
+ - effect of the number of volumes
+## 2. Strength (Contributions of the paper)
+- formulate a general migration problem with three approaches
+ - a greedy algorithm, an ILP-based approach, and hierarchical clustering
+## 3. Weakness (Limitations of the paper)
+- does not provide a system to apply its algorithm
+ - how to collect metadata for solving the optimization problem?
+- hard to follow as the data migration problem is not common yet
+ - only happens in very large-scale storage system
+## 4. Some Insights (Future work)
+- related work
+ - SketchVolume-FAST'19
+ - a greedy algorithm
+ - GoSeed-FAST'20
+ - files are remapped into an initially **empty** target volume
+ - Rangoli-SYSTOR'13
+ - a greedy algorithm for space reclamation
+ - a set of files is deleted to reclaim some of the system's capacity
+- data migration in distributed deduplication systems
+ - if a subsystem becomes full while another subsystem has available capacity, migration is quicker and cheaper than adding capacity to the full subsystem
\ No newline at end of file
diff --git a/StoragePaperNote/Deduplication/Post-Dedup/CompressionEst-FAST'13.md b/StoragePaperNote/Deduplication/Post-Dedup/CompressionEst-FAST'13.md
new file mode 100644
index 0000000..765c3d2
--- /dev/null
+++ b/StoragePaperNote/Deduplication/Post-Dedup/CompressionEst-FAST'13.md
@@ -0,0 +1,94 @@
+typora-copy-images-to: ../paper_figure
+To Zip or Not to Zip: Effective Resource Usage for Real-Time Compression
+| Venue | Category |
+| :------------------------: | :------------------: |
+| FAST'13 | Compression |
+## 1. Summary
+### Motivation of this paper
+- motivation
+ - adding compression on the data path consumes **scarce CPU** and **memory** resources on the storage system
+ - real-time compression for block and file primary storage systems
+ - it is advisable to avoid compressing what we refer to as "incompressible" data
+ - standard LZ type compression algorithms incur higher performance overheads **when the data does not compression well**
+ - 
+- main problem
+ - identifying **incompressible data** in an efficient manner, allowing systems to effectively utilize their limited resources
+ - a macro-scale compression estimation for the whole data set (**offline**)
+ - a micro-scale compressibility test for individual write operations (**online**)
+### Compression Estimation/Test
+- the macro-scale solution
+ - for an entire volume or file system of a storage system
+ - estimate the overall compression ratio with **accuracy guarantee**
+ - the general framework
+ - choose `m` random locations
+ - compute an average of the compression ratio of these locations
+ - location, contribution
+ - real life implementations of compression algorithms are subject to **locality limits **(can use a chunk to define the locality)
+ - don’t want to hold long back pointers
+ - memory management, need to flush their buffers
+ - define the contribution of a byte as **the compression ratio of its locality**
+- the micro-scale solution
+ - for a single write: 8KB, 16KB, 32KB, 128KB
+ - recommend to zip or not to zip (has to be much faster than actual compression)
+ - do not want to read the entire chunk, impossible to get guarantees
+ - the heuristics method
+ - collect **a set of basic indicators** about the chunk
+ - from random samples from the chunk rather than the whole chunk
+ - core-set size: the character set that makes up most of the data
+ - byte-entropy
+ - symbol-pairs distribution indicator (from random distribution)
+ - sample: at most 2KB of data per write buffer
+ - 16 consecutive bytes from up to 128 randomly chosen locations
+ - define several thresholds to test the indicators
+### Implementation and Evaluation
+- implementation
+ - the macro-scale solution: written in C, multi-threaded
+- evaluation
+ - compression ratios v.s. the number of samples
+ - running time v.s. compression trade-off
+ - compared with the prefix method and the full compression
+## 2. Strength (Contributions of the paper)
+- the macro-scale test provides a quick and accurate estimate for which data sets to compress
+- the micro-scale test heuristics have proved critical in reducing resource consumption while maximizing compression for volumes containing a mix of compressible and incompressible data
+## 3. Weakness (Limitations of the paper)
+- is not general to other compression algos (e.g., LZ4, ZSTD)
+- define the thresholds to find a good point for disabling compression is not clear
+- evaluation is limited, no end-to-end system performance evaluation
+## 4. Some Insights (Future work)
+- a bit about compression techniques
+ - this paper focuses on **Zlib** - a popular compression engine for (zip), combines:
+ - **LZ compression**: pointers instead of repetitions
+ - **Huffman encoding**: use shorter encoding to popular characters
+- existing solutions for estimating compression ratios
+ - by file extension
+ - not always accurate, not always available
+ - look at the actual data
+ - scan and compress everything
+ - look at a prefix of (a file or a chunk) and deduce about the rest
+ - not guarantees on the outcome
+ - good for compressible data - zero overhead
+- put all together
+ - when most is compressible
+ - use prefix estimation
+ - when significant percent is incompressible
+ - use heuristics method
+ - when most is incompressible
+ - turn compression off
diff --git a/StoragePaperNote/Deduplication/Post-Dedup/DUPEFS-FAST'22.md b/StoragePaperNote/Deduplication/Post-Dedup/DUPEFS-FAST'22.md
new file mode 100644
index 0000000..11b6d33
--- /dev/null
+++ b/StoragePaperNote/Deduplication/Post-Dedup/DUPEFS-FAST'22.md
@@ -0,0 +1,25 @@
+typora-copy-images-to: ../paper_figure
+DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels
+| Venue | Category |
+| :------------------------: | :------------------: |
+| FAST'22 | secure deduplication |
+## 1. Summary
+### Motivation of this paper
+### Method Name
+### Implementation and Evaluation
+## 2. Strength (Contributions of the paper)
+## 3. Weakness (Limitations of the paper)
+## 4. Some Insights (Future work)
diff --git a/StoragePaperNote/Deduplication/Post-Dedup/DedupSearch-FAST'22.md b/StoragePaperNote/Deduplication/Post-Dedup/DedupSearch-FAST'22.md
new file mode 100644
index 0000000..47c743e
--- /dev/null
+++ b/StoragePaperNote/Deduplication/Post-Dedup/DedupSearch-FAST'22.md
@@ -0,0 +1,108 @@
+typora-copy-images-to: ../paper_figure
+DedupSearch: Two-Phase Deduplication Aware Keyword Search
+| Venue | Category |
+| :------------------------: | :------------------: |
+| FAST'22 | Post-Deduplication functionality |
+## 1. Summary
+### Motivation of this paper
+- motivation
+ - in deduplicated storage, it creates multiple logical pointers from different files and even users, to each physical chunk
+ - this many-to-one relationship complicates many functionalities (e.g., caching, capacity planning, and support for QoS)
+ - present an opportunity to rethink those functionalities to be **deduplication-aware** and **more efficient**
+ - this paper aims to address the keyword search issue in deduplicated storage
+- the main goal
+ - focus on **offline search** of large, deduplicated storage systems for legal or analytics purposes
+- why other approaches cannot work
+ - their index size is proportional to **the logical size of the data** and consume a large fraction of storage capacity
+ - not useful for binary strings or more complex keyword patterns (assume a delimiter set such as whitespace)
+ - their data structures must be continually updated as new data is received
+### DedupSearch
+- naive approaches
+ - opening each file and scanning its content for the specified keywords (**inefficient due to fragmentation and resulting random accesses**)
+ - a given chunk may be read repeatedly from storage due to deduplication
+- main idea
+ - begin with a **physical phase** that performs a **physical scan** of the storage system and scans each chunk of data for the keywords
+ - reading the data sequentially with large I/Os as well as reading each chunk of data only once
+ - record the **exact match** of the keyword, if it is found, as well as the prefixes of suffixes of the keyword (**partial matches**) found at chunk boundaries
+ - then, with a **logical phase** that performs a logical scan of the file system by traversing the chunk pointers that make up the files
+ - instead of reading the actual data chunks
+- challenges
+ - most deduplication systems do not maintain "back pointers" from chunks to the file that contain them (addressed by the logical phase)
+ - cannot associate keyword matches in a chunk with the corresponding file
+ - keywords might be split **between adjacent chunks** in a file (addressed by recording the partial matches)
+ - record the prefixes of the keyword that appear at the end of a chunk and suffixes that appear at the beginning of a chunk
+- string-matching algorithm
+ - use the Aho-Corasick string-match algorithm
+ - a trie-based algorithm for matching multiple strings in a single scan of the input
+ - construct a trie for the **reverse** dictionary to identify suffixes at the beginning of a chunk
+- match result database
+ - exact matches
+ - chunk-result record
+ - location-list record: only if the chunk contains more than one exact match
+ - long location-list record
+ - tiny substrings
+ - keywords that begin or end with frequent letters in the alphabet might result in the allocation of numerous chunk-result record
+ - tiny-result record
+ - only if the chunk does not contain any exact match nor a partial match
+ - database organization
+ - in-memory database: chunk-result index, location-list index
+ - disk-based hash table: the tiny-result index
+- generation of full research results
+ - for each file in the system, the **file recipe** is read, and the fingerprints of its chunks are used to lookup result records in the database
+ - collecting exact match and combining partial matches for each fingerprint
+ - the logical phase can be parallelized to some extent
+ - separate backups or files can be processed in parallel
+### Implementation and Evaluation
+- implementation
+ - based on Destor: three restore thread
+ - use Destor to ingest all the data
+- evaluation
+ - traces
+ - Wikipedia backups, linux kernel versions, and Web server VM backups
+ - linux versions ordered by version, major version, minor version, and patch
+ - Wikipedia backups: archived twice a month since 2017, each snapshot is 1GiB and consists of a single archive file
+ - experiments
+ - DedupSearch performance
+ - effect of deduplication ratio, chunk size, dictionary size, and keywords in the dictionary
+ - DedupSearch data structures
+ - index sizes, database accesses
+ - DedupSearch overheads
+ - physical phase, logical phase
+## 2. Strength (Contributions of the paper)
+- very strong experiments
+- address the string search issue from the deduplication aspect (a new direction)
+ - no previous work targets this issue
+## 3. Weakness (Limitations of the paper)
+- the scenario is limited
+ - is more appropriate when queries are **infrequent** and moderate latency is acceptable such as in legal discovery
+- the main idea is very similar to DeduplicationGC-FAST'17, GoSeed-FAST'20
+ - process the **post-deduplication data** **sequentially** along with an analysis phase **on the file recipes**
+- lack the support of wildcards
+ - since its prefix/suffix approach incur high overhead, it would be more challenging to support wildcards
+ - attempting to match the chunk content starting at all possible offsets within the keyword
+## 4. Some Insights (Future work)
+- the concept from **near-storage processing**
+ - the storage system supports certain computations to **reduce I/O traffic and memory usage**
+- the restore process considered by it
+ - parse the file recipe
+ - looking up the chunk locations in the fingerprint index
+ - reading their containers
\ No newline at end of file
diff --git a/StoragePaperNote/Deduplication/Secure-Dedup/DUPEFS-FAST'22.md b/StoragePaperNote/Deduplication/Secure-Dedup/DUPEFS-FAST'22.md
new file mode 100644
index 0000000..2a8771b
--- /dev/null
+++ b/StoragePaperNote/Deduplication/Secure-Dedup/DUPEFS-FAST'22.md
@@ -0,0 +1,96 @@
+typora-copy-images-to: ../paper_figure
+DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels
+| Venue | Category |
+| :------------------------: | :------------------: |
+| FAST'22 | Secure Deduplication |
+## 1. Summary
+### Motivation of this paper
+- motivation
+ - the implementation in today's advanced filesystems such as ZFS and Btrfs yields **timing side channels** that can reveal whether a chunk of data has been deduplicated
+ - explore the security risks existing in filesystem deduplication
+- main goal
+ - use carefully-crafted read/write operations that show exploitation is not only feasible, but that the signal can be amplified to mount **byte-granular attacks over the network**
+ - the main difference from previous secure deduplication work (memory deduplication):
+ - filesystem operations tend to be **asynchronous** for efficiency
+ - the granularity of filesystem deduplication (often as large as 128 KiB) is large
+- threat model
+ - an attacker who has direct or indirect (possible remote) access to the same filesystem as a victim, and the filesystem performs inline deduplication
+ - local: using low-level system calls such as write(), read(), sync(), fsync()
+ - remote: interacts with the filesystem through a program that is not under the attacker control
+ - e.g., a server program
+- challenges
+ - **performance**: the I/O operations are mostly asynchronous to hide the latency
+ - filesystems cache data complicates the construction of a timing attack
+ - **reliability**: even if data is deduplicated, the metadata still needs to be written to disk, which interferes with the timing channel
+ - **capacity**: modern filesystems perform deduplication only across many blocks that are either temporally or spatially close to each other, clustered together in a deduplication record
+ - increase the entropy of any target secret deduplication record
+- data fingerprinting
+ - relies on the general timed read/write primitive to **reveal the presence of existing known but inaccessible** data
+- data exfiltration
+ - allow two colluding parties with direct/indirect access to the same system to communicate over a stealthy covert channel
+- data leak
+ - alignment probing
+ - stretch controlled data to fill the deduplication record minus one or more bytes of secret data
+ - 
+ - secret spraying
+ - generate a stronger signal over LAN/WAN
+ - spray candidate secret values over many deduplication records and issue many writes for the corresponding guesses
+- attack primitives
+ - 
+- mitigation
+ - using pseudo-same-behavior
+ - write path
+ - even for duplicated data, it still overwrites existing on-disk data
+ - slow down deduplicated write path
+ - read path
+ - introduce time jitter on the read path
+ - enforce pseudo-same-behavior for disk access patterns
+### Implementation and Evaluation
+- evaluation
+ - on FreeBSD for ZFS, and Linux for Btrfs
+ - attack effectiveness
+ - success rate
+ - attack time
+ - I/O
+ - data fingerprinting, data exfiltration, data leak
+## 2. Strength (Contributions of the paper)
+- analyze filesystem deduplication side channels and differentiate it with previous work (asynchronous disk accesses and large deduplication granularities)
+ - the attacker can mount byte-level data leak attacks across the network
+- propose some light-weight mitigation for such attacks
+## 3. Weakness (Limitations of the paper)
+- the remote attack is based on the browser implementation and this is not very general
+- the mitigation approach is practical but cannot completely eradicate the signal
+## 4. Some Insights (Future work)
+- SHA-256 vs. faster hashing
+ - it can also rely on faster hash functions that are not collision-resistant (such as **fletcher4**)
+ - Since hashing may incur collisions, some implementations include an additional step to verify that the data inside the matching deduplication records is identical
+- Deduplication granularity in filesystem deduplication
+ - filesystems perform deduplication at a granularity that is **a multiple of the data block size**
+ - a sufficient number of data blocks must be written to the filesystem to reach the deduplication record size
+- the timed write primitive
+ - the **timing difference** of handling unique data and duplicate data
+ - process duplicate data is cheaper (only update the metadata)
+ - allow attacker to leak whether certain data is present on the filesystem during a write operation
+- the timed read primitive
+ - duplicated data from different files end up in distinct physical memory pages
+ - as the page cache (in Linux) operates at the file level
+ - if a block of a file becomes deduplicated, its physical location on the disk **differs from its surrounding blocks**
\ No newline at end of file
diff --git a/StoragePaperNote/template.md b/StoragePaperNote/template.md
index 2765d67..a0745df 100644
--- a/StoragePaperNote/template.md
+++ b/StoragePaperNote/template.md
@@ -1,8 +1,8 @@
typora-copy-images-to: ../paper_figure
-Redesigning LSMs for Nonvolatile Memory with NoveLSM
+# To Zip or Not to Zip: Effective Resource Usage for Real-Time Compression
| Venue | Category |
| :------------------------: | :------------------: |
| ATC'18 | LSM+PM |
@@ -20,4 +20,3 @@ Redesigning LSMs for Nonvolatile Memory with NoveLSM
## 3. Weakness (Limitations of the paper)
## 4. Some Insights (Future work)
diff --git a/paper_figure/image-20220316134336877.png b/paper_figure/image-20220316134336877.png
new file mode 100644
index 0000000..fc46ef1
Binary files /dev/null and b/paper_figure/image-20220316134336877.png differ
diff --git a/paper_figure/image-20220316134407739.png b/paper_figure/image-20220316134407739.png
new file mode 100644
index 0000000..d3ab1ab
Binary files /dev/null and b/paper_figure/image-20220316134407739.png differ
diff --git a/paper_figure/image-20220526171832531.png b/paper_figure/image-20220526171832531.png
new file mode 100644
index 0000000..7dfb4c3
Binary files /dev/null and b/paper_figure/image-20220526171832531.png differ