update: IDEA-FAST'24

yzr95924 · Mar 21, 2024 · 5466b0e · 5466b0e
1 parent d77c7b8
commit 5466b0e
Show file tree

Hide file tree

Showing 9 changed files with 203 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -56,8 +56,6 @@ A reading list related to storage systems, including data deduplication, erasure
 22. *The Dilemma between Deduplication and Locality: Can Both be Achieved?*---FAST'21 ([link](https://www.usenix.org/system/files/fast21-zou.pdf)) [summary](https://yzr95924.github.io/paper_summary/MFDedup-FAST'21.html)
 23. *SLIMSTORE: A Cloud-based Deduplication System for Multi-version Backups*----ICDE'21 ([link](http://www.cs.utah.edu/~lifeifei/papers/slimstore-icde21.pdf))
 24. *Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling*----ACM TOS'21 ([link](https://dl.acm.org/doi/full/10.1145/3459626))
-25. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html)
-26. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf))
 
 ### Restore Performances
 
@@ -135,7 +133,7 @@ A reading list related to storage systems, including data deduplication, erasure
 1. *Data Domain Cloud Tier: Backup here, Backup there, Deduplicated Everywhere!*----USENIX ATC'19 ([link](https://www.usenix.org/system/files/atc19-duggal.pdf)) [summary]( https://yzr95924.github.io/paper_summary/CloudTier-ATC'19.html )
 2. *InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication*----FAST'23 ([link](https://www.usenix.org/system/files/fast23-kotlarska.pdf)) [summary](https://yzr95924.github.io/paper_summary/InftyDedup-FAST'23.html)
 
-### Post-Deduplication: Data Compression and Delta Compression
+### Post-Deduplication: Data Compression, Delta Compression, and Application
 1. *Redundancy Elimination Within Large Collections of Files*----USENIX ATC'04 ([link](https://www.usenix.org/legacy/publications/library/proceedings/usenix04/tech/general/full_papers/kulkarni/kulkarni.pdf))
 2. *The Design of a Similarity Based Deduplication System*----SYSTOR'09 ([link](https://dl.acm.org/doi/pdf/10.1145/1534530.1534539))
 3. *Delta Compressed and Deduplicated Storage Using Stream-Informed Locality*----HotStorage'12 ([link](https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final38_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/deltaStore-HotStorage'12.html)
@@ -158,6 +156,8 @@ A reading list related to storage systems, including data deduplication, erasure
 20. *Donag: Generating Eficient Patches and Difs for Compressed Archives*----ACM TOS'22 ([link](https://dl.acm.org/doi/pdf/10.1145/3507919))
 21. *LoopDelta: Embedding Locality-aware Opportunistic Delta Compression in Inline Deduplication for Highly Efficient Data Reduction*----USENIX ATC'23 ([link](https://www.usenix.org/system/files/atc23-zhang-yucheng.pdf))
 22. *Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta Compression*----ASPLOS'24 ([link](https://qiangsu97.github.io/files/asplos24spring-final6.pdf))
+23. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html)
+24. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf)) [summary](https://yzr95924.github.io/paper_summary/IDEA-FAST'24.html)
 
 ### Memory && Block-Layer Deduplication
 
@@ -517,7 +517,7 @@ A reading list related to storage systems, including data deduplication, erasure
 ### HPC Storage
 
 1. *GPFS: A Shared-Disk File System for Large Computing Clusters*----FAST'02 ([link](https://www.usenix.org/legacy/publications/library/proceedings/fast02/full_papers/schmuck/schmuck.pdf))
-2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](Efficient Object Storage Journaling in a Distributed Parallel File System))
+2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](https://www.usenix.org/legacy/events/fast10/tech/full_papers/oral.pdf))
 3. *Taking back control of HPC file systems with Robinhood Policy Engine*----arxiv'15 ([link](https://arxiv.org/abs/1505.01448))
 4. *Lustre Lockahead: Early Experience and Performance using Optimized Locking*----CUG'17 ([link](https://cug.org/proceedings/cug2017_proceedings/includes/files/pap141s2-file1.pdf))
 5. *LPCC: Hierarchical Persistent  Client Caching for Lustre*----SC'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3295500.3356139)) [slides](https://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap112s5.pdf)

diff --git a/paper_figure/image-20240319012231775.png b/paper_figure/image-20240319012231775.png
diff --git a/paper_figure/image-20240321001530743.png b/paper_figure/image-20240321001530743.png
diff --git a/paper_figure/image-20240321002025742.png b/paper_figure/image-20240321002025742.png
diff --git a/paper_figure/image-20240321204347685.png b/paper_figure/image-20240321204347685.png
diff --git a/paper_figure/image-20240321210826877.png b/paper_figure/image-20240321210826877.png
diff --git a/storage_paper_note/IDEA-FAST'24.md b/storage_paper_note/IDEA-FAST'24.md
@@ -0,0 +1,195 @@
+---
+typora-copy-images-to: ../paper_figure
+---
+# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index
+
+|           Venue            |       Category       |
+| :------------------------: | :------------------: |
+| FAST'24 | Deduplicated System Design, Post-Deduplication Management |
+[TOC]
+
+## 1. Summary
+### Motivation of this paper
+
+- motivation
+  - indexing deduplicated data might result in extreme inefficiencies
+    - index size
+      - proportion to the logical data size, **regardless of its deduplication ratio**
+        - each term must point to all the files containing it, <u>**even if the files' content is almost identical**</u>
+    - index creation overhead
+      - random and redundant accesses to the physical chunks
+    -  **term indexing** is not supported by any deduplicating storage system
+      - focus on **textual data**
+      - VMware vSphere and Commvault only support file indexing
+        - identifies individual files within a backup based on metadata
+      - Dell-EMC Data Protection Search
+        - support full content indexing
+          - warn: processing the full content of a large number of files can be **time consuming**
+            - recommend performing targeted indexing on **specific backups and file types**
+- challenge
+  - two separate trends
+    - the growing need to process **cold data** (e.g., old backups)
+      - e.g., full-system scans, keyword searches --> deduplication-aware search
+    - the growing application of deduplication on primary storage of hot and warm data
+      - e.g., perform single-term searches for files within deduplicated personal workstation
+  - indexing software on file-system level --> **unaware** of the underlying deduplication at the storage system
+    - index size
+      - increase --> increase the latency of lookups
+    - index time
+      - scan all files in the system --> random IOs, high read amplification
+    - split terms
+      - chunking process will likely split the incoming data into chunks (at **arbitrary position**)
+        - splitting words between adjacent chunks
+
+### IDEA
+
+- ![image-20240321002025742](./../paper_figure/image-20240321002025742.png)
+
+- key idea
+  - map terms to the unique physical chunks they appear in
+    - instead of the logical documents (disproportionately high)
+    - replace term-to-file mapping with
+      - term-to-chunk map
+      - chunk-to-file map (file ID)
+  - only need to modify chunking process in deduplication system
+    - **white-space aware** --> enforce chunk boundaries only between words
+- white-space aligned chunking
+  - content-defined chunking
+    - **continue scanning** the following characters until a white-space character is encountered
+  - fixed-size chunking
+    - **backward scanning** this chunk until a white-space character is encountered
+      - resulting chunks are always smaller than the fixed size --> can be stored in a single block
+    - can trim the block in memory to chunk boundary
+  - non-textual content
+    - only to chunking of **textual content**
+    - identify textual content by the file extension of the incoming data
+      - .c, .h, and .htm
+    - add a Boolean field to the metadata of each chunk in the file recipe and container
+      - only process chunks marked as textual
+- term-to-chunk mapping
+  - number of documents in the index --> number of physical chunks
+    - might be higher than the number of logical files
+    - chunks are **read sequentially**, each chunk is processed only once
+      - processing chunks is easily parallelizable
+
+  - lookup
+    - return the fingerprints of the chunks this term appears
+
+- chunk-to-file mapping
+  - two complementing maps
+    - chunk-to-file map
+      - chunk fingerprint --> file IDs
+    - file-to-path map
+      - file IDs --> file's full pathname
+  - created from the metadata in the file recipe
+
+- keyword/term lookup
+  - step-1: yield the fingerprints of all the relevant chunks
+  - step-2: a series of lookups in the chunk-to-file map
+    - retrieves the IDs of all files containing these chunks
+  - step-3: a lookup of each file ID in the file-to-path map
+    - returns the final list of file names
+- ranking results
+  - extend IDEA to support document ranking with the TF-IDF metric
+
+### Implementation and Evaluation
+
+- implementation
+  - LucenePlusPlus + Destor
+    - use Lucene term-to-doc map
+    - ![image-20240321204347685](./../paper_figure/image-20240321204347685.png)
+    - scan all file receipes from Destor
+      - create the list of files containing each chunk using a key-value store
+    - use an SSD for the data structures which are external to Lucene
+- experimental setup
+  - trace
+    - ![image-20240321210826877](./../paper_figure/image-20240321210826877.png)
+
+  - hardware
+    - maps of all index alternatives were stored on a separate HDD
+    - chunk-to-file and file-to-path maps of IDEA were stored on a SSD
+
+- evaluation
+  - baseline
+    - traditional deduplication-oblivious indexing (Naive)
+
+  - indexing time
+    - the reduction is proportional to the **deduplication ratio** 
+      - recipe-processing time is negligible compared to the chunk-processing time
+
+    - indexing time of IDEA is shorter than that of Naive by 49% to 76%
+
+  - index size
+    - Naive must record more files for all the terms include in them
+    - IDEA additional information is recorded per chunk, not per term
+
+  - lookup times
+    - is faster than Naive by up to 82%
+    - smaller size of its term-to-doc map
+      - incur shorter lookup latency
+
+  - IDEA overhead
+    - IDEA has no advantage when compared to deduplication-oblivious indexing
+      - additional layer of indirection incurs **non-negligible overheads are masked** <u>where the deduplication ratio is sufficiently high</u>
+
+
+## 2. Strength (Contributions of the paper)
+
+- first design of a deduplication-aware term index
+- implementation of IDEA on Lucene
+  - open-source single-node inverted index used by the Elasticsearch
+- extensive evaluation
+
+## 3. Weakness (Limitations of the paper)
+
+- trace is not very large
+- files containing compressed text (.pdf, .docx)
+  - their textual content can only be processed after the file is opened by a suitable application or converted by a dedicated tool
+  - individual chunks cannot be processed during offline index creation
+
+## 4. Some Insights (Future work)
+
+- deduplication scenarios
+  - backup and archival systems
+    - log-structured manner: chunk --> containers
+    - content-defined chunking
+  - primary (non-backup) storage system and appliances
+    - support direct access to <u>individual chunks</u>
+    - fixed-sized chunking
+      - align the deduplicated chunks with the storage interface
+- deduplication data management
+  - implicit sharing of content between files, complicates the followings: transforms logically-sequential data accesses to random IOs in the underlying physical media
+    - GC
+    - load balancing between volumes
+    - caching
+    - charge-back
+- term indexing: **term-to-file** indexing (map)
+  - ![image-20240321001530743](./../paper_figure/image-20240321001530743.png)
+  - return the files containing **a keyword** or **term**
+    - search engines, data analytics
+    - searched data might be deduplicated
+    - e.g. Elasticsearch
+      - built on top of the single-node Apache Lucene
+        - based on a hierarchy of skip-lists
+      - other variations
+        - Amazon OpenSearch, IBM Watson
+  - keyword: any searchable strings (natural language words)
+  - query
+    - the list of files containing this keyword
+    - optional: byte offsets in which the term appears
+  - indexing creation
+    - collect the documents
+    - identify the terms within each document
+    - normalize the terms
+    - create the list of documents, and optionally offsets, containing each term
+  - result ranking
+    - using a **scoring formula** on each result
+    - TF-IDF
+      - ![image-20240319012231775](./../paper_figure/image-20240319012231775.png)
+- deduplication basic
+  - file recipe
+    - a list of chunks' fingerprints, their sizes
+    - restore: locate the chunk by searching in the fingerprint map or cache of its entries
+  - pack the **compressed data** into containers
+- standard storage functionality
+  - can be made more efficient by taking advantage of deduplicated state
diff --git a/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md b/storage_paper_note/deduplication/secure_dedup/DUPEFS-FAST'22.md
@@ -3,8 +3,8 @@ typora-copy-images-to: ../paper_figure
 ---
 DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels
 ------------------------------------------
-|           Venue            |       Category       |
-| :------------------------: | :------------------: |
+|  Venue  |       Category       |
+| :-----: | :------------------: |
 | FAST'22 | Secure Deduplication |
 [TOC]
 

diff --git a/storage_paper_note/template.md b/storage_paper_note/template.md
@@ -1,11 +1,11 @@
 ---
 typora-copy-images-to: ../paper_figure
 ---
-# Light-Dedup: A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems
+# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index
 
 |           Venue            |       Category       |
 | :------------------------: | :------------------: |
-| ATC'18 | LSM+PM |
+| FAST'24 | Deduplicated System Design, post-deduplication application |
 [TOC]
 
 ## 1. Summary