Project Proposal: Openverse Datasets #2637

zackkrida · 2023-07-12T11:40:15Z

Fixes

Related to #2545

Description

This PR adds a project proposal for the Dataset project. I've tried to get this out quickly so it might be a bit rough. Suggestions are very welcome. I've asked @sarayourfriend (for providing past feedback on this initiative) and @AetherUnbound (for general data expertise) to review from the Openverse side.

I would also appreciate insights from @apolinario on the HuggingFace platform: how it relates to this project but also some of its general functionality which I touch on in the proposal.

Descisionmaking

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.

Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

Will be resolved by 2023-07-20.

Testing Instructions

Read the document in GitHub's code view or the generated docs preview.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2023-07-12T11:56:13Z

Full-stack documentation: https://docs.openverse.org/_preview/2637

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

apolinario · 2023-07-12T17:53:36Z

Heya, I think the project proposal looks great @zackkrida! Thanks for putting it together! I made very minor comments across the doc

how it relates to this project but also some of its general functionality which I touch on in the proposal.

I think the document has a fair summary of the capabilities of the platform relative to this project. I would just add that the streaming feature of datasets allow for making it accessible to people that may not have access to storage that allows storing the entire data dump. May be helpful as part of the democratisation of this data

Skylion007 · 2023-07-12T17:56:04Z

Heya, I think the project proposal looks great @zackkrida! Thanks for putting it together! I made very minor comments across the doc

how it relates to this project but also some of its general functionality which I touch on in the proposal.

I think the document has a fair summary of the capabilities of the platform relative to this project. I would just add that the streaming feature of datasets allow for making it accessible to people that may not have access to storage that allows storing the entire dataset. May be helpful as part of the democratisation of this data

+1 on streaming the dataset. It also can allow people to easily and quickly generate various subsets of the data.

AetherUnbound

This project plan looks excellent, thank you for drafting it Zack! My notes are merely surface level/wording, I'm aligned with everything else here 🙂

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

sarayourfriend

This looks great. I'd like clarification and/or additional information about the following points:

Licensors, from my perspective, are also stakeholders. Respecting their intentions and properly communicating the usage conditions is especially important for a project where every single work has an explicit license. Noting specific license elements like NC, ND, and SA that have known nuances in the distribution of the dataset feels important enough to list as a requirement.
On the other hand, "licensors" are not the only stakeholders from that perspective and PDM works present a significant complication in this regard, both from a regional legal perspective and from the perspective of the communication of cultural artefacts that the institutions distributing PDM marked works based on those artefacts have obtained and distributed without consultation or otherwise involvement of the culture the artefacts were taken from. Openverse can't do much to fix the underlying problems, but we do need to take care to protect our liability in this regard the same way we do in our general terms of service. Should the dataset have the same terms of service applied? A specific terms of service/disclaimer worded directly for the dataset would better protect the project, I imagine.
Does the first implementation plan also include the documentation updates you mentioned? Can that be listed as an explicit requirement? It is easy to miss something like that when writing an implementation plan that is sure to already be significant in other respects.

Anyway, everything sounds good to me. The rationale to use HuggingFace makes sense. My only concern moving forward is to ensure that we've covered our bases as far as liability to ourselves and have communicated as effectively as possible to users of the dataset their responsibility in using the dataset.

Nothing else to add on top of what others have shared.

I didn't mean to approve. I think we can expedite this project proposal fairly easily but I do want clarification on my three points before approving.

zackkrida · 2023-07-17T14:46:20Z

Drafting this proposal while I move it into the Revision round. Feedback is still welcomed!

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

zackkrida · 2023-07-17T20:44:21Z

I've addressed reviewer comments and this proposal is now ready for a decision.

sarayourfriend

LGTM! Shall we add a line item to the priorities meeting to discuss the implementation plans? How does this fall in line with the rest of our work? (we can discuss this on the project thread, if it's more involved)

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

zackkrida · 2023-07-18T15:27:48Z

I'm going to leave this open for a few more days with the goal of soliciting further community feedback.

zackkrida · 2023-07-18T18:57:14Z

I'll leave some scheduling thoughts on the project thread, @sarayourfriend

zackkrida · 2023-07-20T22:07:52Z

@apolinario yesterday I saw that the https://huggingface.co/meta-llama/Llama-2-7b model has an access flow which requires accepting terms and signing up through a meta controlled page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

I am curious: is this type of flow available to datasets, as well? We wouldn't necessarily want folks to wait 1-2 days for access, but the idea of more-explicitly enforcing our terms for dataset usage (respecting proper usage and whatnot) is appealing.

sarayourfriend · 2023-07-21T00:38:44Z

@zackkrida it is possible for datasets: https://huggingface.co/docs/hub/datasets-gated

We could use a small interaction with our Django service (Django rendered page with the form and more flexibility in presentation etc) by using the manual approvals.

apolinario · 2023-07-24T13:51:12Z

As @sarayourfriend noted, gating repos is totally available for datasets!

And it comes in two flavours: manual approval which basically restricts who filled in the form and has the approval done manually (or automated that via a Django service).

BUT if you are going to automate the approvals, the other mode of gated access (automatic approval) would make more sense imo. Basically this requires people to read/accept information they can use it, but everyone that does is accepted automatically and can use it

It is what Mozilla uses for common voice (example here) - I think that could make sense for this project as well

zackkrida · 2023-07-24T17:53:10Z

Thanks, @apolinario! I am going to draft this proposal while we make more efforts to consider our specific use conditions and ethical standards for the dataset, along with how they would relate to the "Gating" functionality.

apolinario · 2023-07-24T22:25:38Z

Sounds good! Here's an idea that I had that I hope could help tackling a few challenges that may arise for use conditions & ethical standards:

Multiple-subsets

Instead of a single Openverse dataset (or two, one for visual media and one for audio), we could create multiple subsets based on license or license-grouping, e.g. (not really name suggestion here, just examples):

openverse/images-cc-by
openverse/images-cc-by-nc
openverse/images-cc0
etc.

All in the same data format, but each could have its own dataset card and its own set of disclaimers and descriptions (and all under your terms of service disclaimers ofc). In one hand, this could make using this datasets for downstream tasks a bit more convoluted/complex (as now one has to engage with/accept terms/process multiple datasets), on the other hand, it would make it very obvious what each dataset could be for, and it could inform downstream users very specifically what they are doing, as they would need to write code that looks like:

from datasets import load_dataset
dataset_cc_by = load_dataset("openverse/images-cc-by", token=True)
dataset_cc_by_nc = load_dataset("openverse/images-cc-by-nc", token=True)
#do downstream tasks processing both

That could make it pretty clear that they should not do that if they are looking into doing smth commercial. Although filtering a column by value (e.g.: the license column) enough with the HF datasets library, but maybe this is a way to make it even more explicit and understood even from reading the code - and not only by reading the model card.

This could also co-exist with gating, so each dataset repo could be gated. Btw the load_dataset function gives a 403 error if the dataset is gated and your HF user isn't in.

krysal

I love the several good points captured here. Just restating what has been said, but it's an excellent proposal!

krysal · 2023-07-26T00:15:49Z

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

+or made easier by the publication of the datasets. This could work in a few
+ways. A community member, training a model using the Openverse dataset,
+generates metadata that we want and planned to generate ourselves. Then, the
+HuggingFace platform presents an alternative to other SaSS products we intended


Is "SaSS" the short version of something? I can't find a different meaning other than the CSS extension language, SASS. Can we add the full form or the meaning in a note/footnote, maybe?

typo for SaaS (software as a service), I'll update to explain.

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md

Co-authored-by: Madison Swain-Bowden <[email protected]>

AetherUnbound · 2023-08-17T17:47:56Z

Merging this for now so the document is in our documentation site, even though we are not planning on pursuing it at this time.

zackkrida requested a review from a team as a code owner July 12, 2023 11:40

zackkrida requested review from AetherUnbound, stacimc and sarayourfriend and removed request for stacimc July 12, 2023 11:40

github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Jul 12, 2023

openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Jul 12, 2023

apolinario reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

apolinario reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

apolinario reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

apolinario reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

Skylion007 reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

apolinario reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

Skylion007 reviewed Jul 12, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

AetherUnbound approved these changes Jul 13, 2023

View reviewed changes

sarayourfriend previously approved these changes Jul 13, 2023

View reviewed changes

zackkrida mentioned this pull request Jul 13, 2023

Providing and maintaining Openverse media datasets #2545

Closed

2 tasks

AetherUnbound mentioned this pull request Jul 14, 2023

Add a list of (changed) rendered files to the "docs emitted" PR comment #2646

Closed

zackkrida marked this pull request as draft July 17, 2023 14:47

Skylion007 reviewed Jul 17, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

sarayourfriend approved these changes Jul 18, 2023

View reviewed changes

sarayourfriend reviewed Jul 18, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

zackkrida commented Jul 18, 2023

View reviewed changes

documentation/projects/proposals/publish_dataset/20230706-project_proposal.md Outdated Show resolved Hide resolved

zackkrida marked this pull request as draft July 24, 2023 17:53

krysal approved these changes Jul 26, 2023

View reviewed changes

zackkrida and others added 9 commits August 17, 2023 09:59

WIP

3f70e43

WIP

c048d87

WIP

068ae0e

Apply suggestions from code review

c68194e

Co-authored-by: Madison Swain-Bowden <[email protected]>

format after changes

9a5d598

Deduplication cleanup

ff2b6ae

Clarify initial dump considerations

a9aba74

Clarification from Saras review

1adbaa2

Apply suggestions from code review

8f54781

AetherUnbound force-pushed the dataset-project-proposal branch from 2ea868a to 8f54781 Compare August 17, 2023 17:01

AetherUnbound added 3 commits August 17, 2023 10:04

Fix some typos

6f5951a

Add note about pausing efforts

f90ec82

Rename, standardize title

923123d

AetherUnbound marked this pull request as ready for review August 17, 2023 17:16

AetherUnbound merged commit d8574af into main Aug 17, 2023
48 checks passed

AetherUnbound deleted the dataset-project-proposal branch August 17, 2023 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Proposal: Openverse Datasets #2637

Project Proposal: Openverse Datasets #2637

zackkrida commented Jul 12, 2023 •

edited

Loading

github-actions bot commented Jul 12, 2023 •

edited

Loading

apolinario commented Jul 12, 2023 •

edited

Loading

Skylion007 commented Jul 12, 2023

AetherUnbound left a comment

sarayourfriend left a comment

zackkrida commented Jul 17, 2023 •

edited

Loading

zackkrida commented Jul 17, 2023

sarayourfriend left a comment

zackkrida commented Jul 18, 2023

zackkrida commented Jul 18, 2023

zackkrida commented Jul 20, 2023

sarayourfriend commented Jul 21, 2023 •

edited

Loading

apolinario commented Jul 24, 2023 •

edited

Loading

zackkrida commented Jul 24, 2023

apolinario commented Jul 24, 2023 •

edited

Loading

krysal left a comment

krysal Jul 26, 2023

zackkrida Jul 26, 2023 •

edited

Loading

AetherUnbound commented Aug 17, 2023

Project Proposal: Openverse Datasets #2637

Project Proposal: Openverse Datasets #2637

Conversation

zackkrida commented Jul 12, 2023 • edited Loading

Fixes

Description

Descisionmaking

Current round

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Jul 12, 2023 • edited Loading

apolinario commented Jul 12, 2023 • edited Loading

Skylion007 commented Jul 12, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

sarayourfriend left a comment

Choose a reason for hiding this comment

zackkrida commented Jul 17, 2023 • edited Loading

zackkrida commented Jul 17, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

zackkrida commented Jul 18, 2023

zackkrida commented Jul 18, 2023

zackkrida commented Jul 20, 2023

sarayourfriend commented Jul 21, 2023 • edited Loading

apolinario commented Jul 24, 2023 • edited Loading

zackkrida commented Jul 24, 2023

apolinario commented Jul 24, 2023 • edited Loading

Multiple-subsets

krysal left a comment

Choose a reason for hiding this comment

krysal Jul 26, 2023

Choose a reason for hiding this comment

zackkrida Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

AetherUnbound commented Aug 17, 2023

zackkrida commented Jul 12, 2023 •

edited

Loading

github-actions bot commented Jul 12, 2023 •

edited

Loading

apolinario commented Jul 12, 2023 •

edited

Loading

zackkrida commented Jul 17, 2023 •

edited

Loading

sarayourfriend commented Jul 21, 2023 •

edited

Loading

apolinario commented Jul 24, 2023 •

edited

Loading

apolinario commented Jul 24, 2023 •

edited

Loading

zackkrida Jul 26, 2023 •

edited

Loading