Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not dowload? #10

Open
sri-hk opened this issue Jan 24, 2024 · 5 comments
Open

can not dowload? #10

sri-hk opened this issue Jan 24, 2024 · 5 comments

Comments

@sri-hk
Copy link

sri-hk commented Jan 24, 2024

hi,nice wrok ! but the data can not be downloaded.

image
this dataset/
laion2b-en-vit-h-14-embeddings became disabled,any other soultion to get your deduplicated laion-2b-en data?

look forward to your reply,3ks!

@Jesse-XIE
Copy link

+1

@ryanwebster90
Copy link
Owner

as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.

@ppwwyyxx
Copy link

ppwwyyxx commented Feb 1, 2024

@ryanwebster90 Can you provide deduped results based on urls?

This way it does not require users to download the now-deleted dataset on huggingface. With urls, labs that have downloaded laion2b (but in a different format/order from the huggingface dataset) will be able to leverage your deduped results.

@ryanwebster90
Copy link
Owner

@ppwwyyxx I can not, as the dataset is facing ethical issues, and don't plan to. For now, I'd suggest to check out DataComp-1B, and perhaps I'll plan to deduplicate that dataset.

@sri-hk
Copy link
Author

sri-hk commented Feb 2, 2024

as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.

Thank you for your reply.
I find another way to deduplicate. So far so good. Phash is highly efficient and also fast, but you may need many cpus and memories.

@sri-hk sri-hk closed this as completed Feb 2, 2024
@sri-hk sri-hk reopened this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants