-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Proposal: Openverse Datasets #2637
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/2637 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
Heya, I think the project proposal looks great @zackkrida! Thanks for putting it together! I made very minor comments across the doc
I think the document has a fair summary of the capabilities of the platform relative to this project. I would just add that the streaming feature of datasets allow for making it accessible to people that may not have access to storage that allows storing the entire data dump. May be helpful as part of the democratisation of this data |
+1 on streaming the dataset. It also can allow people to easily and quickly generate various subsets of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This project plan looks excellent, thank you for drafting it Zack! My notes are merely surface level/wording, I'm aligned with everything else here 🙂
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I'd like clarification and/or additional information about the following points:
- Licensors, from my perspective, are also stakeholders. Respecting their intentions and properly communicating the usage conditions is especially important for a project where every single work has an explicit license. Noting specific license elements like NC, ND, and SA that have known nuances in the distribution of the dataset feels important enough to list as a requirement.
- On the other hand, "licensors" are not the only stakeholders from that perspective and PDM works present a significant complication in this regard, both from a regional legal perspective and from the perspective of the communication of cultural artefacts that the institutions distributing PDM marked works based on those artefacts have obtained and distributed without consultation or otherwise involvement of the culture the artefacts were taken from. Openverse can't do much to fix the underlying problems, but we do need to take care to protect our liability in this regard the same way we do in our general terms of service. Should the dataset have the same terms of service applied? A specific terms of service/disclaimer worded directly for the dataset would better protect the project, I imagine.
- Does the first implementation plan also include the documentation updates you mentioned? Can that be listed as an explicit requirement? It is easy to miss something like that when writing an implementation plan that is sure to already be significant in other respects.
Anyway, everything sounds good to me. The rationale to use HuggingFace makes sense. My only concern moving forward is to ensure that we've covered our bases as far as liability to ourselves and have communicated as effectively as possible to users of the dataset their responsibility in using the dataset.
Nothing else to add on top of what others have shared.
I didn't mean to approve. I think we can expedite this project proposal fairly easily but I do want clarification on my three points before approving.
Drafting this proposal while I move it into the Revision round. Feedback is still welcomed! |
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
I've addressed reviewer comments and this proposal is now ready for a decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Shall we add a line item to the priorities meeting to discuss the implementation plans? How does this fall in line with the rest of our work? (we can discuss this on the project thread, if it's more involved)
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
I'm going to leave this open for a few more days with the goal of soliciting further community feedback. |
I'll leave some scheduling thoughts on the project thread, @sarayourfriend |
@apolinario yesterday I saw that the https://huggingface.co/meta-llama/Llama-2-7b model has an access flow which requires accepting terms and signing up through a meta controlled page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ I am curious: is this type of flow available to datasets, as well? We wouldn't necessarily want folks to wait 1-2 days for access, but the idea of more-explicitly enforcing our terms for dataset usage (respecting proper usage and whatnot) is appealing. |
@zackkrida it is possible for datasets: https://huggingface.co/docs/hub/datasets-gated We could use a small interaction with our Django service (Django rendered page with the form and more flexibility in presentation etc) by using the manual approvals. |
As @sarayourfriend noted, gating repos is totally available for datasets! And it comes in two flavours: BUT if you are going to automate the approvals, the other mode of gated access ( It is what Mozilla uses for common voice (example here) - I think that could make sense for this project as well |
Thanks, @apolinario! I am going to draft this proposal while we make more efforts to consider our specific use conditions and ethical standards for the dataset, along with how they would relate to the "Gating" functionality. |
Sounds good! Here's an idea that I had that I hope could help tackling a few challenges that may arise for use conditions & ethical standards: Multiple-subsetsInstead of a single Openverse dataset (or two, one for visual media and one for audio), we could create multiple subsets based on license or license-grouping, e.g. (not really name suggestion here, just examples):
All in the same data format, but each could have its own dataset card and its own set of disclaimers and descriptions (and all under your terms of service disclaimers ofc). In one hand, this could make using this datasets for downstream tasks a bit more convoluted/complex (as now one has to engage with/accept terms/process multiple datasets), on the other hand, it would make it very obvious what each dataset could be for, and it could inform downstream users very specifically what they are doing, as they would need to write code that looks like: from datasets import load_dataset
dataset_cc_by = load_dataset("openverse/images-cc-by", token=True)
dataset_cc_by_nc = load_dataset("openverse/images-cc-by-nc", token=True)
#do downstream tasks processing both That could make it pretty clear that they should not do that if they are looking into doing smth commercial. Although filtering a column by value (e.g.: the license column) enough with the HF This could also co-exist with gating, so each dataset repo could be gated. Btw the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the several good points captured here. Just restating what has been said, but it's an excellent proposal!
or made easier by the publication of the datasets. This could work in a few | ||
ways. A community member, training a model using the Openverse dataset, | ||
generates metadata that we want and planned to generate ourselves. Then, the | ||
HuggingFace platform presents an alternative to other SaSS products we intended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "SaSS" the short version of something? I can't find a different meaning other than the CSS extension language, SASS. Can we add the full form or the meaning in a note/footnote, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo for SaaS (software as a service), I'll update to explain.
documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Madison Swain-Bowden <[email protected]>
2ea868a
to
8f54781
Compare
Merging this for now so the document is in our documentation site, even though we are not planning on pursuing it at this time. |
Fixes
Related to #2545
Description
This PR adds a project proposal for the Dataset project. I've tried to get this out quickly so it might be a bit rough. Suggestions are very welcome. I've asked @sarayourfriend (for providing past feedback on this initiative) and @AetherUnbound (for general data expertise) to review from the Openverse side.
I would also appreciate insights from @apolinario on the HuggingFace platform: how it relates to this project but also some of its general functionality which I touch on in the proposal.
Descisionmaking
This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.
Current round
This discussion is currently in the Decision round.
Will be resolved by 2023-07-20.
Testing Instructions
Read the document in GitHub's code view or the generated docs preview.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin