You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+3-17
Original file line number
Diff line number
Diff line change
@@ -8,18 +8,16 @@
8
8
9
9
## This repo is a WIP
10
10
11
-
This repo is a WIP, but the main functionalities will be:
11
+
You no longer can filter the LAION dataset to remove duplicates, as LAION disabled the webdataset on huggingface. I'll focus on adding some functionality for deduplication for future webdatasets using clip features.
12
12
13
-
-[x] Download de-duplicated versions of LAION-2B-en (Better versions coming soon...)
14
-
-[ ] Download small indices (25-40GB) for retrieval / dataset creation / de-duplciation
15
13
-[ ] Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
16
14
-[x] Read our research paper
17
15
-[ ] Train SNIP on your CLIP features
18
16
-[ ] Run a de-duplication of your dataset using our de-dup code
19
17
20
18
SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it [here](https://arxiv.org/abs/2303.12733).
21
19
22
-
We used SNIP to perform several de-duplications of LAION-2B-en. Our latest de-duplication found roughly 700M duplicates (we define total duplicates as total samples - duplicate groups). SNIP performs well at high compression ratios and can run at very high q/s with low memory.
20
+
We used SNIP together with the faiss library to deduplicate a billions scale dataset, and found a high level of duplication (roughly 700M / 2 billion). This webdataset is no longer being distributed by laion.
23
21
24
22
## Install
25
23
@@ -66,18 +64,6 @@ while True:
66
64
67
65
The labels of the above loop can be found on huggingface [vitl14_labels](https://huggingface.co/datasets/fraisdufour/snip-dedup/resolve/main/representatives/representatives_vitl14_fixed_pt.npy).
68
66
69
-
## Misc files (old)
70
-
71
-
We release this index for public use and exploration of the LAION-2B-en dataset.
72
-
73
-
You may find the following necessary files here:
74
-
75
-
[Binary array of De-duplicated Images](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)
[cumulative sizes of features (for indexing sharded files)](https://drive.google.com/file/d/1OdVt5rjYw55XfMhsQSdqcVOP7lG2qj4W/view?usp=sharing)
@@ -101,7 +87,7 @@ you may check a list of (randomly sampled) detected duplicate pairs [here](https
101
87
102
88
## Semantic Search
103
89
104
-
SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.
90
+
You may use the compressed features to do semantic search with faiss (see for instance, the clip-retrieval repository).
0 commit comments