Skip to content

Commit

Permalink
update: IDEA-FAST'24
Browse files Browse the repository at this point in the history
  • Loading branch information
yzr95924 committed Mar 21, 2024
1 parent d77c7b8 commit 5466b0e
Show file tree
Hide file tree
Showing 9 changed files with 203 additions and 8 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,6 @@ A reading list related to storage systems, including data deduplication, erasure
22. *The Dilemma between Deduplication and Locality: Can Both be Achieved?*---FAST'21 ([link](https://www.usenix.org/system/files/fast21-zou.pdf)) [summary](https://yzr95924.github.io/paper_summary/MFDedup-FAST'21.html)
23. *SLIMSTORE: A Cloud-based Deduplication System for Multi-version Backups*----ICDE'21 ([link](http://www.cs.utah.edu/~lifeifei/papers/slimstore-icde21.pdf))
24. *Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling*----ACM TOS'21 ([link](https://dl.acm.org/doi/full/10.1145/3459626))
25. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html)
26. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf))

### Restore Performances

Expand Down Expand Up @@ -135,7 +133,7 @@ A reading list related to storage systems, including data deduplication, erasure
1. *Data Domain Cloud Tier: Backup here, Backup there, Deduplicated Everywhere!*----USENIX ATC'19 ([link](https://www.usenix.org/system/files/atc19-duggal.pdf)) [summary]( https://yzr95924.github.io/paper_summary/CloudTier-ATC'19.html )
2. *InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication*----FAST'23 ([link](https://www.usenix.org/system/files/fast23-kotlarska.pdf)) [summary](https://yzr95924.github.io/paper_summary/InftyDedup-FAST'23.html)

### Post-Deduplication: Data Compression and Delta Compression
### Post-Deduplication: Data Compression, Delta Compression, and Application
1. *Redundancy Elimination Within Large Collections of Files*----USENIX ATC'04 ([link](https://www.usenix.org/legacy/publications/library/proceedings/usenix04/tech/general/full_papers/kulkarni/kulkarni.pdf))
2. *The Design of a Similarity Based Deduplication System*----SYSTOR'09 ([link](https://dl.acm.org/doi/pdf/10.1145/1534530.1534539))
3. *Delta Compressed and Deduplicated Storage Using Stream-Informed Locality*----HotStorage'12 ([link](https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final38_0.pdf)) [summary](https://yzr95924.github.io/paper_summary/deltaStore-HotStorage'12.html)
Expand All @@ -158,6 +156,8 @@ A reading list related to storage systems, including data deduplication, erasure
20. *Donag: Generating Eficient Patches and Difs for Compressed Archives*----ACM TOS'22 ([link](https://dl.acm.org/doi/pdf/10.1145/3507919))
21. *LoopDelta: Embedding Locality-aware Opportunistic Delta Compression in Inline Deduplication for Highly Efficient Data Reduction*----USENIX ATC'23 ([link](https://www.usenix.org/system/files/atc23-zhang-yucheng.pdf))
22. *Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta Compression*----ASPLOS'24 ([link](https://qiangsu97.github.io/files/asplos24spring-final6.pdf))
23. *DedupSearch: Two-Phase Deduplication Aware Keyword Search*----FAST'22 ([link](https://www.usenix.org/system/files/fast22-elias.pdf)) [summary](https://yzr95924.github.io/paper_summary/DedupSearch-FAST'22.html)
24. *Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index*----FAST'24 ([link](https://www.usenix.org/system/files/fast24-levi.pdf)) [summary](https://yzr95924.github.io/paper_summary/IDEA-FAST'24.html)

### Memory && Block-Layer Deduplication

Expand Down Expand Up @@ -517,7 +517,7 @@ A reading list related to storage systems, including data deduplication, erasure
### HPC Storage

1. *GPFS: A Shared-Disk File System for Large Computing Clusters*----FAST'02 ([link](https://www.usenix.org/legacy/publications/library/proceedings/fast02/full_papers/schmuck/schmuck.pdf))
2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](Efficient Object Storage Journaling in a Distributed Parallel File System))
2. *Efficient Object Storage Journaling in a Distributed Parallel File System*----FAST'10 ([link](https://www.usenix.org/legacy/events/fast10/tech/full_papers/oral.pdf))
3. *Taking back control of HPC file systems with Robinhood Policy Engine*----arxiv'15 ([link](https://arxiv.org/abs/1505.01448))
4. *Lustre Lockahead: Early Experience and Performance using Optimized Locking*----CUG'17 ([link](https://cug.org/proceedings/cug2017_proceedings/includes/files/pap141s2-file1.pdf))
5. *LPCC: Hierarchical Persistent Client Caching for Lustre*----SC'19 ([link](https://dl.acm.org/doi/pdf/10.1145/3295500.3356139)) [slides](https://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap112s5.pdf)
Expand Down
Binary file added paper_figure/image-20240319012231775.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20240321001530743.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20240321002025742.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20240321204347685.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper_figure/image-20240321210826877.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
195 changes: 195 additions & 0 deletions storage_paper_note/IDEA-FAST'24.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
typora-copy-images-to: ../paper_figure
---
# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index

| Venue | Category |
| :------------------------: | :------------------: |
| FAST'24 | Deduplicated System Design, Post-Deduplication Management |
[TOC]

## 1. Summary
### Motivation of this paper

- motivation
- indexing deduplicated data might result in extreme inefficiencies
- index size
- proportion to the logical data size, **regardless of its deduplication ratio**
- each term must point to all the files containing it, <u>**even if the files' content is almost identical**</u>
- index creation overhead
- random and redundant accesses to the physical chunks
- **term indexing** is not supported by any deduplicating storage system
- focus on **textual data**
- VMware vSphere and Commvault only support file indexing
- identifies individual files within a backup based on metadata
- Dell-EMC Data Protection Search
- support full content indexing
- warn: processing the full content of a large number of files can be **time consuming**
- recommend performing targeted indexing on **specific backups and file types**
- challenge
- two separate trends
- the growing need to process **cold data** (e.g., old backups)
- e.g., full-system scans, keyword searches --> deduplication-aware search
- the growing application of deduplication on primary storage of hot and warm data
- e.g., perform single-term searches for files within deduplicated personal workstation
- indexing software on file-system level --> **unaware** of the underlying deduplication at the storage system
- index size
- increase --> increase the latency of lookups
- index time
- scan all files in the system --> random IOs, high read amplification
- split terms
- chunking process will likely split the incoming data into chunks (at **arbitrary position**)
- splitting words between adjacent chunks

### IDEA

- ![image-20240321002025742](./../paper_figure/image-20240321002025742.png)

- key idea
- map terms to the unique physical chunks they appear in
- instead of the logical documents (disproportionately high)
- replace term-to-file mapping with
- term-to-chunk map
- chunk-to-file map (file ID)
- only need to modify chunking process in deduplication system
- **white-space aware** --> enforce chunk boundaries only between words
- white-space aligned chunking
- content-defined chunking
- **continue scanning** the following characters until a white-space character is encountered
- fixed-size chunking
- **backward scanning** this chunk until a white-space character is encountered
- resulting chunks are always smaller than the fixed size --> can be stored in a single block
- can trim the block in memory to chunk boundary
- non-textual content
- only to chunking of **textual content**
- identify textual content by the file extension of the incoming data
- .c, .h, and .htm
- add a Boolean field to the metadata of each chunk in the file recipe and container
- only process chunks marked as textual
- term-to-chunk mapping
- number of documents in the index --> number of physical chunks
- might be higher than the number of logical files
- chunks are **read sequentially**, each chunk is processed only once
- processing chunks is easily parallelizable

- lookup
- return the fingerprints of the chunks this term appears

- chunk-to-file mapping
- two complementing maps
- chunk-to-file map
- chunk fingerprint --> file IDs
- file-to-path map
- file IDs --> file's full pathname
- created from the metadata in the file recipe

- keyword/term lookup
- step-1: yield the fingerprints of all the relevant chunks
- step-2: a series of lookups in the chunk-to-file map
- retrieves the IDs of all files containing these chunks
- step-3: a lookup of each file ID in the file-to-path map
- returns the final list of file names
- ranking results
- extend IDEA to support document ranking with the TF-IDF metric

### Implementation and Evaluation

- implementation
- LucenePlusPlus + Destor
- use Lucene term-to-doc map
- ![image-20240321204347685](./../paper_figure/image-20240321204347685.png)
- scan all file receipes from Destor
- create the list of files containing each chunk using a key-value store
- use an SSD for the data structures which are external to Lucene
- experimental setup
- trace
- ![image-20240321210826877](./../paper_figure/image-20240321210826877.png)

- hardware
- maps of all index alternatives were stored on a separate HDD
- chunk-to-file and file-to-path maps of IDEA were stored on a SSD

- evaluation
- baseline
- traditional deduplication-oblivious indexing (Naive)

- indexing time
- the reduction is proportional to the **deduplication ratio**
- recipe-processing time is negligible compared to the chunk-processing time

- indexing time of IDEA is shorter than that of Naive by 49% to 76%

- index size
- Naive must record more files for all the terms include in them
- IDEA additional information is recorded per chunk, not per term

- lookup times
- is faster than Naive by up to 82%
- smaller size of its term-to-doc map
- incur shorter lookup latency

- IDEA overhead
- IDEA has no advantage when compared to deduplication-oblivious indexing
- additional layer of indirection incurs **non-negligible overheads are masked** <u>where the deduplication ratio is sufficiently high</u>


## 2. Strength (Contributions of the paper)

- first design of a deduplication-aware term index
- implementation of IDEA on Lucene
- open-source single-node inverted index used by the Elasticsearch
- extensive evaluation

## 3. Weakness (Limitations of the paper)

- trace is not very large
- files containing compressed text (.pdf, .docx)
- their textual content can only be processed after the file is opened by a suitable application or converted by a dedicated tool
- individual chunks cannot be processed during offline index creation

## 4. Some Insights (Future work)

- deduplication scenarios
- backup and archival systems
- log-structured manner: chunk --> containers
- content-defined chunking
- primary (non-backup) storage system and appliances
- support direct access to <u>individual chunks</u>
- fixed-sized chunking
- align the deduplicated chunks with the storage interface
- deduplication data management
- implicit sharing of content between files, complicates the followings: transforms logically-sequential data accesses to random IOs in the underlying physical media
- GC
- load balancing between volumes
- caching
- charge-back
- term indexing: **term-to-file** indexing (map)
- ![image-20240321001530743](./../paper_figure/image-20240321001530743.png)
- return the files containing **a keyword** or **term**
- search engines, data analytics
- searched data might be deduplicated
- e.g. Elasticsearch
- built on top of the single-node Apache Lucene
- based on a hierarchy of skip-lists
- other variations
- Amazon OpenSearch, IBM Watson
- keyword: any searchable strings (natural language words)
- query
- the list of files containing this keyword
- optional: byte offsets in which the term appears
- indexing creation
- collect the documents
- identify the terms within each document
- normalize the terms
- create the list of documents, and optionally offsets, containing each term
- result ranking
- using a **scoring formula** on each result
- TF-IDF
- ![image-20240319012231775](./../paper_figure/image-20240319012231775.png)
- deduplication basic
- file recipe
- a list of chunks' fingerprints, their sizes
- restore: locate the chunk by searching in the fingerprint map or cache of its entries
- pack the **compressed data** into containers
- standard storage functionality
- can be made more efficient by taking advantage of deduplicated state
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ typora-copy-images-to: ../paper_figure
---
DUPEFS: Leaking Data Over the Network With Filesystem Deduplication Side Channels
------------------------------------------
| Venue | Category |
| :------------------------: | :------------------: |
| Venue | Category |
| :-----: | :------------------: |
| FAST'22 | Secure Deduplication |
[TOC]

Expand Down
4 changes: 2 additions & 2 deletions storage_paper_note/template.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
typora-copy-images-to: ../paper_figure
---
# Light-Dedup: A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems
# Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index

| Venue | Category |
| :------------------------: | :------------------: |
| ATC'18 | LSM+PM |
| FAST'24 | Deduplicated System Design, post-deduplication application |
[TOC]

## 1. Summary
Expand Down

0 comments on commit 5466b0e

Please sign in to comment.