Allow entire dataset to be downloaded en-masse #669

AetherUnbound · 2022-09-20T04:19:36Z

Description

Presently if users want our entire dataset, they must crawl through all possible searches in hopes of pulling up the results we have. We've discussed this in the past, but it would be ideal to have a bulk download option available for those who would like to use the entire dataset (e.g. iNaturalist's dataset: https://github.com/inaturalist/inaturalist-open-data)

This could be parquet or TSV files on S3 which have public accessibility, or some other means of pulling the entire dataset.

Implementation

🙋 I would be interested in implementing this feature.

MallikharjunaTeja · 2022-10-11T14:25:43Z

i want to work on this feature @AetherUnbound @dhruvkb could you assign me this

AetherUnbound · 2022-10-25T23:45:18Z

Hi @MallikharjunaTeja! Thanks for offering your assistance 🙂 Before work proceeds on this, we need a plan fleshed out for what these bulk downloads would look like. How will the files be generated from our system? Would there need to be coordination with the Openverse Catalog, since we would likely need a scheduled DAG in order to run this? What fields and/or models would we include and exclude? I think this project will ultimately need an RFC written for it, you can find instructions and examples here: https://github.com/WordPress/openverse/tree/main/rfcs. The maintainers group currently doesn't have this slated for our near-term priorities, but if you would like to go ahead and give this a shot please feel free! We're happy to assist you and answer any questions you might have, particularly over in the Make WP Slack #openverse channel. Please let me know if you'd like to take on this work and I'll assign the issue to you.

Alternatively, we have a large number of issues across our repos which are marked as "good first issues". These issues were ones we felt it might be easy to jump into as a contributor. If you're looking to contribute to the project in general, I encourage you to take a look at the list here. We'd be happy to assign any one of those issues to you as well 😄

Skylion007 · 2023-06-02T21:16:56Z

Interested in discussing this, even for a one time export.

zackkrida · 2023-07-12T11:36:34Z

Closing this in favor of tracking via this project: #2545

AetherUnbound added 🌟 goal: addition Addition of new feature 🕹 aspect: interface Concerns end-users' experience with the software 🟩 priority: low Low priority and doesn't need to be rushed labels Sep 20, 2022

krysal added the 💬 talk: discussion Open for discussions and feedback label Oct 12, 2022

obulat transferred this issue from WordPress/openverse-api Feb 22, 2023

obulat added the stack: backend label Feb 22, 2023

obulat added this to Openverse Backlog Feb 23, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023

obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023

zackkrida closed this as completed Jul 12, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow entire dataset to be downloaded en-masse #669

Allow entire dataset to be downloaded en-masse #669

AetherUnbound commented Sep 20, 2022

MallikharjunaTeja commented Oct 11, 2022 •

edited

Loading

AetherUnbound commented Oct 25, 2022 •

edited

Loading

Skylion007 commented Jun 2, 2023

zackkrida commented Jul 12, 2023

Allow entire dataset to be downloaded en-masse #669

Allow entire dataset to be downloaded en-masse #669

Comments

AetherUnbound commented Sep 20, 2022

Description

Implementation

MallikharjunaTeja commented Oct 11, 2022 • edited Loading

AetherUnbound commented Oct 25, 2022 • edited Loading

Skylion007 commented Jun 2, 2023

zackkrida commented Jul 12, 2023

MallikharjunaTeja commented Oct 11, 2022 •

edited

Loading

AetherUnbound commented Oct 25, 2022 •

edited

Loading