Skip to content

Commit 30ce3ab

Browse files
Update README.md
1 parent fc67854 commit 30ce3ab

File tree

1 file changed

+3
-17
lines changed

1 file changed

+3
-17
lines changed

README.md

+3-17
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,16 @@
88

99
## This repo is a WIP
1010

11-
This repo is a WIP, but the main functionalities will be:
11+
You no longer can filter the LAION dataset to remove duplicates, as LAION disabled the webdataset on huggingface. I'll focus on adding some functionality for deduplication for future webdatasets using clip features.
1212

13-
- [x] Download de-duplicated versions of LAION-2B-en (Better versions coming soon...)
14-
- [ ] Download small indices (25-40GB) for retrieval / dataset creation / de-duplciation
1513
- [ ] Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
1614
- [x] Read our research paper
1715
- [ ] Train SNIP on your CLIP features
1816
- [ ] Run a de-duplication of your dataset using our de-dup code
1917

2018
SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it [here](https://arxiv.org/abs/2303.12733).
2119

22-
We used SNIP to perform several de-duplications of LAION-2B-en. Our latest de-duplication found roughly 700M duplicates (we define total duplicates as total samples - duplicate groups). SNIP performs well at high compression ratios and can run at very high q/s with low memory.
20+
We used SNIP together with the faiss library to deduplicate a billions scale dataset, and found a high level of duplication (roughly 700M / 2 billion). This webdataset is no longer being distributed by laion.
2321

2422
## Install
2523

@@ -66,18 +64,6 @@ while True:
6664

6765
The labels of the above loop can be found on huggingface [vitl14_labels](https://huggingface.co/datasets/fraisdufour/snip-dedup/resolve/main/representatives/representatives_vitl14_fixed_pt.npy).
6866

69-
## Misc files (old)
70-
71-
We release this index for public use and exploration of the LAION-2B-en dataset.
72-
73-
You may find the following necessary files here:
74-
75-
[Binary array of De-duplicated Images](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)
76-
77-
[SNIP index](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)
78-
79-
[SNIP descriptor](https://drive.google.com/file/d/1QTA9yWevwPMhvMW8P5mAIBDy42xUpr-m/view?usp=share_link)
80-
8167
Other:
8268

8369
[cumulative sizes of features (for indexing sharded files)](https://drive.google.com/file/d/1OdVt5rjYw55XfMhsQSdqcVOP7lG2qj4W/view?usp=sharing)
@@ -101,7 +87,7 @@ you may check a list of (randomly sampled) detected duplicate pairs [here](https
10187

10288
## Semantic Search
10389

104-
SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.
90+
You may use the compressed features to do semantic search with faiss (see for instance, the clip-retrieval repository).
10591

10692
## Contribute
10793

0 commit comments

Comments
 (0)