Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a component for image based deduplication #476

Open
satishjasthi opened this issue Sep 29, 2023 · 3 comments
Open

Add a component for image based deduplication #476

satishjasthi opened this issue Sep 29, 2023 · 3 comments
Assignees
Labels
Components Implementation of components

Comments

@satishjasthi
Copy link
Contributor

This issue is to create a new component for image based deduplication for the Fondant-cc-25m data preprocessing pipeline using imagededup library

@geroldmeisinger
Copy link

geroldmeisinger commented Oct 3, 2023

I'm watching your project with interest and recently had to deduplicate an image dataset on my own for controlnet training. Out of interest, I wonder why you choose imagededup specificly and if you have a comparison of different approaches? (btw: In my research I stumbled upon fastdup by visual-layer, which was easy to use and fast, and fiftyone by voxel51, which seems more sophisticated (and also includes a image suite).)

@satishjasthi
Copy link
Contributor Author

Hi @geroldmeisinger, Thanks for your suggestions, Indeed Fastdup and Fiftyone's image uniqueness component seems much better than imagededup lib. I'll be performing test between Fastdup and Fiftyone to select the best one for image deduplication component. And Ill share the comparison results

@geroldmeisinger
Copy link

  1. img2dataset, the download tool from LAION, mentions..
  2. DataComp: In search of the next generation of multimodal datasets a super-phat 12.8B dataset and collab of some 30 authors, which mentions...
  3. "F Deduplication against evaluation sets" -> Contrastive Learning with Large Memory Bank and Negative Embedding Subtraction for Accurate Copy Detection as "We employ the deduplication model proposed by Yokoo [145], which earned 1st place in the Facebook AI Image Similarity Challenge (ISC) [ 40].", which is
  4. The 2021 Image Similarity Dataset and Challenge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Components Implementation of components
Projects
Status: Backlog
Development

No branches or pull requests

3 participants