-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow entire dataset to be downloaded en-masse #669
Comments
i want to work on this feature @AetherUnbound @dhruvkb could you assign me this |
Hi @MallikharjunaTeja! Thanks for offering your assistance 🙂 Before work proceeds on this, we need a plan fleshed out for what these bulk downloads would look like. How will the files be generated from our system? Would there need to be coordination with the Openverse Catalog, since we would likely need a scheduled DAG in order to run this? What fields and/or models would we include and exclude? I think this project will ultimately need an RFC written for it, you can find instructions and examples here: https://github.com/WordPress/openverse/tree/main/rfcs. The maintainers group currently doesn't have this slated for our near-term priorities, but if you would like to go ahead and give this a shot please feel free! We're happy to assist you and answer any questions you might have, particularly over in the Make WP Slack #openverse channel. Please let me know if you'd like to take on this work and I'll assign the issue to you. Alternatively, we have a large number of issues across our repos which are marked as "good first issues". These issues were ones we felt it might be easy to jump into as a contributor. If you're looking to contribute to the project in general, I encourage you to take a look at the list here. We'd be happy to assign any one of those issues to you as well 😄 |
Interested in discussing this, even for a one time export. |
Closing this in favor of tracking via this project: #2545 |
Description
Presently if users want our entire dataset, they must crawl through all possible searches in hopes of pulling up the results we have. We've discussed this in the past, but it would be ideal to have a bulk download option available for those who would like to use the entire dataset (e.g. iNaturalist's dataset: https://github.com/inaturalist/inaturalist-open-data)
This could be parquet or TSV files on S3 which have public accessibility, or some other means of pulling the entire dataset.
Implementation
The text was updated successfully, but these errors were encountered: