diff --git a/README.md b/README.md index 9dab2e6..bea343c 100644 --- a/README.md +++ b/README.md @@ -56,8 +56,6 @@ A reading list related to storage systems, including data deduplication, erasure 22. *The Dilemma between Deduplication and Locality: Can Both be Achieved?*---FAST'21 ([link](https://www.usenix.org/system/files/fast21-zou.pdf)) [summary](https://yzr95924.github.io/paper_summary/MFDedup-FAST'21.html) 23. *SLIMSTORE: A Cloud-based Deduplication System for Multi-version Backups*----ICDE'21 ([link](http://www.cs.utah.edu/~lifeifei/papers/slimstore-icde21.pdf)) 24. *Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling*----ACM TOS'21 ([link](https://dl.acm.org/doi/full/10.1145/3459626)) -25. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html) -26. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf)) ### Restore Performances @@ -135,7 +133,7 @@ A reading list related to storage systems, including data deduplication, erasure 1. *Data Domain Cloud Tier: Backup here, Backup there, Deduplicated Everywhere!*----USENIX ATC'19 ([link](https://www.usenix.org/system/files/atc19-duggal.pdf)) [summary]( https://yzr95924.github.io/paper_summary/CloudTier-ATC'19.html ) 2. *InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication*----FAST'23 ([link](https://www.usenix.org/system/files/fast23-kotlarska.pdf)) [summary](https://yzr95924.github.io/paper_summary/InftyDedup-FAST'23.html) -### Post-Deduplication: Data Compression and Delta Compression +### Post-Deduplication: Data Compression, Delta Compression, and Application 1. *Redundancy Elimination Within Large Collections of Files*----USENIX ATC'04 ([link](https://www.usenix.org/legacy/publications/library/proceedings/usenix04/tech/general/full_papers/kulkarni/kulkarni.pdf)) 2. *The Design of a Similarity Based Deduplication System*----SYSTOR'09 ([link](https://dl.acm.org/doi/pdf/10.1145/1534530.1534539)) 3. *Delta Compressed and Deduplicated Storage Using Stream-Informed Locality*----HotStorage'12 ([link](https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final38_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/deltaStore-HotStorage'12.html) @@ -158,6 +156,8 @@ A reading list related to storage systems, including data deduplication, erasure 20. *Donag: Generating Eficient Patches and Difs for Compressed Archives*----ACM TOS'22 ([link](https://dl.acm.org/doi/pdf/10.1145/3507919)) 21. *LoopDelta: Embedding Locality-aware Opportunistic Delta Compression in Inline Deduplication for Highly Efficient Data Reduction*----USENIX ATC'23 ([link](https://www.usenix.org/system/files/atc23-zhang-yucheng.pdf)) 22. *Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta Compression*----ASPLOS'24 ([link](https://qiangsu97.github.io/files/asplos24spring-final6.pdf)) +23. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html) +24. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf)) [summary](https://yzr95924.github.io/paper_summary/IDEA-FAST'24.html) ### Memory && Block-Layer Deduplication @@ -517,7 +517,7 @@ A reading list related to storage systems, including data deduplication, erasure ### HPC Storage 1. *GPFS: A Shared-Disk File System for Large Computing Clusters*----FAST'02 ([link](https://www.usenix.org/legacy/publications/library/proceedings/fast02/full_papers/schmuck/schmuck.pdf)) -2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](Efficient Object Storage Journaling in a Distributed Parallel File System)) +2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](https://www.usenix.org/legacy/events/fast10/tech/full_papers/oral.pdf)) 3. *Taking back control of HPC file systems with Robinhood Policy Engine*----arxiv'15 ([link](https://arxiv.org/abs/1505.01448)) 4. *Lustre Lockahead: Early Experience and Performance using Optimized Locking*----CUG'17 ([link](https://cug.org/proceedings/cug2017_proceedings/includes/files/pap141s2-file1.pdf)) 5. *LPCC: Hierarchical Persistent Client Caching for Lustre*----SC'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3295500.3356139)) [slides](https://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap112s5.pdf) diff --git a/paper_figure/image-20240319012231775.png b/paper_figure/image-20240319012231775.png new file mode 100644 index 0000000..f856de4 Binary files /dev/null and b/paper_figure/image-20240319012231775.png differ diff --git a/paper_figure/image-20240321001530743.png b/paper_figure/image-20240321001530743.png new file mode 100644 index 0000000..a3589ba Binary files /dev/null and b/paper_figure/image-20240321001530743.png differ diff --git a/paper_figure/image-20240321002025742.png b/paper_figure/image-20240321002025742.png new file mode 100644 index 0000000..486ede9 Binary files /dev/null and b/paper_figure/image-20240321002025742.png differ diff --git a/paper_figure/image-20240321204347685.png b/paper_figure/image-20240321204347685.png new file mode 100644 index 0000000..c552e48 Binary files /dev/null and b/paper_figure/image-20240321204347685.png differ diff --git a/paper_figure/image-20240321210826877.png b/paper_figure/image-20240321210826877.png new file mode 100644 index 0000000..a21ca1c Binary files /dev/null and b/paper_figure/image-20240321210826877.png differ diff --git a/storage_paper_note/IDEA-FAST'24.md b/storage_paper_note/IDEA-FAST'24.md new file mode 100644 index 0000000..532673a --- /dev/null +++ b/storage_paper_note/IDEA-FAST'24.md @@ -0,0 +1,195 @@ +--- +typora-copy-images-to: ../paper_figure +--- +# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index + +| Venue | Category | +| :------------------------: | :------------------: | +| FAST'24 | Deduplicated System Design, Post-Deduplication Management | +[TOC] + +## 1. Summary +### Motivation of this paper + +- motivation + - indexing deduplicated data might result in extreme inefficiencies + - index size + - proportion to the logical data size, **regardless of its deduplication ratio** + - each term must point to all the files containing it, **even if the files' content is almost identical** + - index creation overhead + - random and redundant accesses to the physical chunks + - **term indexing** is not supported by any deduplicating storage system + - focus on **textual data** + - VMware vSphere and Commvault only support file indexing + - identifies individual files within a backup based on metadata + - Dell-EMC Data Protection Search + - support full content indexing + - warn: processing the full content of a large number of files can be **time consuming** + - recommend performing targeted indexing on **specific backups and file types** +- challenge + - two separate trends + - the growing need to process **cold data** (e.g., old backups) + - e.g., full-system scans, keyword searches --> deduplication-aware search + - the growing application of deduplication on primary storage of hot and warm data + - e.g., perform single-term searches for files within deduplicated personal workstation + - indexing software on file-system level --> **unaware** of the underlying deduplication at the storage system + - index size + - increase --> increase the latency of lookups + - index time + - scan all files in the system --> random IOs, high read amplification + - split terms + - chunking process will likely split the incoming data into chunks (at **arbitrary position**) + - splitting words between adjacent chunks + +### IDEA + +- ![image-20240321002025742](./../paper_figure/image-20240321002025742.png) + +- key idea + - map terms to the unique physical chunks they appear in + - instead of the logical documents (disproportionately high) + - replace term-to-file mapping with + - term-to-chunk map + - chunk-to-file map (file ID) + - only need to modify chunking process in deduplication system + - **white-space aware** --> enforce chunk boundaries only between words +- white-space aligned chunking + - content-defined chunking + - **continue scanning** the following characters until a white-space character is encountered + - fixed-size chunking + - **backward scanning** this chunk until a white-space character is encountered + - resulting chunks are always smaller than the fixed size --> can be stored in a single block + - can trim the block in memory to chunk boundary + - non-textual content + - only to chunking of **textual content** + - identify textual content by the file extension of the incoming data + - .c, .h, and .htm + - add a Boolean field to the metadata of each chunk in the file recipe and container + - only process chunks marked as textual +- term-to-chunk mapping + - number of documents in the index --> number of physical chunks + - might be higher than the number of logical files + - chunks are **read sequentially**, each chunk is processed only once + - processing chunks is easily parallelizable + + - lookup + - return the fingerprints of the chunks this term appears + +- chunk-to-file mapping + - two complementing maps + - chunk-to-file map + - chunk fingerprint --> file IDs + - file-to-path map + - file IDs --> file's full pathname + - created from the metadata in the file recipe + +- keyword/term lookup + - step-1: yield the fingerprints of all the relevant chunks + - step-2: a series of lookups in the chunk-to-file map + - retrieves the IDs of all files containing these chunks + - step-3: a lookup of each file ID in the file-to-path map + - returns the final list of file names +- ranking results + - extend IDEA to support document ranking with the TF-IDF metric + +### Implementation and Evaluation + +- implementation + - LucenePlusPlus + Destor + - use Lucene term-to-doc map + - ![image-20240321204347685](./../paper_figure/image-20240321204347685.png) + - scan all file receipes from Destor + - create the list of files containing each chunk using a key-value store + - use an SSD for the data structures which are external to Lucene +- experimental setup + - trace + - ![image-20240321210826877](./../paper_figure/image-20240321210826877.png) + + - hardware + - maps of all index alternatives were stored on a separate HDD + - chunk-to-file and file-to-path maps of IDEA were stored on a SSD + +- evaluation + - baseline + - traditional deduplication-oblivious indexing (Naive) + + - indexing time + - the reduction is proportional to the **deduplication ratio** + - recipe-processing time is negligible compared to the chunk-processing time + + - indexing time of IDEA is shorter than that of Naive by 49% to 76% + + - index size + - Naive must record more files for all the terms include in them + - IDEA additional information is recorded per chunk, not per term + + - lookup times + - is faster than Naive by up to 82% + - smaller size of its term-to-doc map + - incur shorter lookup latency + + - IDEA overhead + - IDEA has no advantage when compared to deduplication-oblivious indexing + - additional layer of indirection incurs **non-negligible overheads are masked** where the deduplication ratio is sufficiently high + + +## 2. Strength (Contributions of the paper) + +- first design of a deduplication-aware term index +- implementation of IDEA on Lucene + - open-source single-node inverted index used by the Elasticsearch +- extensive evaluation + +## 3. Weakness (Limitations of the paper) + +- trace is not very large +- files containing compressed text (.pdf, .docx) + - their textual content can only be processed after the file is opened by a suitable application or converted by a dedicated tool + - individual chunks cannot be processed during offline index creation + +## 4. Some Insights (Future work) + +- deduplication scenarios + - backup and archival systems + - log-structured manner: chunk --> containers + - content-defined chunking + - primary (non-backup) storage system and appliances + - support direct access to individual chunks + - fixed-sized chunking + - align the deduplicated chunks with the storage interface +- deduplication data management + - implicit sharing of content between files, complicates the followings: transforms logically-sequential data accesses to random IOs in the underlying physical media + - GC + - load balancing between volumes + - caching + - charge-back +- term indexing: **term-to-file** indexing (map) + - ![image-20240321001530743](./../paper_figure/image-20240321001530743.png) + - return the files containing **a keyword** or **term** + - search engines, data analytics + - searched data might be deduplicated + - e.g. Elasticsearch + - built on top of the single-node Apache Lucene + - based on a hierarchy of skip-lists + - other variations + - Amazon OpenSearch, IBM Watson + - keyword: any searchable strings (natural language words) + - query + - the list of files containing this keyword + - optional: byte offsets in which the term appears + - indexing creation + - collect the documents + - identify the terms within each document + - normalize the terms + - create the list of documents, and optionally offsets, containing each term + - result ranking + - using a **scoring formula** on each result + - TF-IDF + - ![image-20240319012231775](./../paper_figure/image-20240319012231775.png) +- deduplication basic + - file recipe + - a list of chunks' fingerprints, their sizes + - restore: locate the chunk by searching in the fingerprint map or cache of its entries + - pack the **compressed data** into containers +- standard storage functionality + - can be made more efficient by taking advantage of deduplicated state diff --git a/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md b/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md index 2a8771b..b932811 100644 --- a/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md +++ b/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md @@ -3,8 +3,8 @@ typora-copy-images-to: ../paper_figure --- DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels ------------------------------------------ -| Venue | Category | -| :------------------------: | :------------------: | +| Venue | Category | +| :-----: | :------------------: | | FAST'22 | Secure Deduplication | [TOC] diff --git a/storage_paper_note/template.md b/storage_paper_note/template.md index 1a2857f..93e806b 100644 --- a/storage_paper_note/template.md +++ b/storage_paper_note/template.md @@ -1,11 +1,11 @@ --- typora-copy-images-to: ../paper_figure --- -# Light-Dedup: A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems +# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index | Venue | Category | | :------------------------: | :------------------: | -| ATC'18 | LSM+PM | +| FAST'24 | Deduplicated System Design, post-deduplication application | [TOC] ## 1. Summary