Providing and maintaining Openverse media datasets #2545

zackkrida · 2023-07-04T17:49:18Z

Description

This project aims to publish and regularly update datasets for each of Openverse's media types, currently image and audio. We aim to provide access to the data currently served by our API, but which is difficult and costly to access in full.

This project aims to:

Increase data access to the Openverse dataset for academics, researchers, and other interested users
Reduce scraping of the Openverse API and frontend caused by lack of access
Collect downstream metadata whose generation is made possible by the ease of access to the dataset, for example:
- Machine captions
- Translation of metadata
- Deduplication data

Through conversations with and offers from @Skylion007 and @apolinario we've identified HuggingFace as a home for the initial, raw metadata dump. They've graciously agreed to help and/or publish complementary datasets.

This project will need a proposal, and at least one implementation plan for producing the initial dump and/or establishing a mechanism for creating yearly/every-six-month dumps of the Openverse dataset.

Documents

Project Proposal: Openverse Datasets #2637
Implementation Plan(s)

Issues

Prior Art

sarayourfriend · 2023-07-06T22:35:28Z

I got to chat with Zack about this today and have two requests for HuggingFace that I think are worthwhile conditions of the dataset that should be published alongside it:

Do not use "no-derivatives" (ND) licensed works to create derivative works. That's redundant with the license text, but because this is still a contentious point with respect to machine learning models, I think it's worth making an explicit note about this. Openverse includes hundreds of millions of works and only a fraction are licensed with ND. Because every single work catalogued by Openverse includes explicit license terms that allow for specific usage, not using ND licensed works to great anything that might even questionable (from various perspectives) be considered a derivative work (like an ML model) can just exclude them, probably to not detriment of the model. And if it does hurt the model, then great, that's excellent information to know for model creators and they should be prepared to acknowledge that problem and ready to deal with the consequences of it. Specifically, are they really ready to explicitly rely on works to create models that creators have likewise explicitly said they do not wish to have derivatives made from? In any case, if model creators simply do not use ND works, which Openverse's dataset makes trivial to do because we distribute fine-grained license information with every work in our catalogue, it would go a long way to garnering good faith from folks who've openly licensed their works.
Similarly, "no-commercial" (NC) licensed works should not be used for models created to be used in commercial contexts. Those models should likewise be licensed to explicitly disallow commercial use (just as a work derivative of an NC licensed work would be). I'm not as clear on this as the ND stuff, as in, I'm not sure if derivative works based on NC licensed works can themselves be used commercially. In any case, as above, it's probably safe to say that not using NC licensed works for models that will be used commercial should not hurt the models considering that NC licensed works are only a fraction of the hundreds of millions of openly licensed works in the Openverse catalogue.
Finally, an attribution page listing every single work used in the creation of the model. It's one of the basic parts of CC licenses other than 0. Every work in Openverse's catalogue other than the CC0 and PDM works require attribution. In the spirit of good faith of that, providing attribution for the models the work is trained on should be trivial (compared to the work of training the model). Yes it would be a massive page of text, but that itself would also drive home the provenance of such models and the fact that they can only exist because huge numbers of artists and creators generally have decided to license their works openly: a tremendous thing that can be celebrated, but only if it is explicitly noted.

apolinario · 2023-07-06T23:19:01Z

Excelent points @sarayourfriend!

Fully in agreement that all these 3 factors are things that have to be absolutely considered for any downstream applications of the dataset, including for training models (and it could also inform the creation of sub-sets that are). It also merits the points from @Skylion007 on the basis of creating and maintaining image mirrors and sub-sets for benchmarking and training models.

In regards to how best communicate this points to the dataset users on the Hugging Face Hub, this is usually done in the Dataset Card. I'd be more than happy to work with you on drafting a dataset card to go online with the model. Here's more information on dataset cards and an example.

zackkrida · 2023-07-06T23:19:25Z

@sarayourfriend thanks for capturing the chat here!

Here's a sample dataset on huggingface (@Skylion007's dataset, no less!), which should be a decent representation of what ours would look like:

https://huggingface.co/datasets/openwebtext

I imagine much of what we chatted about could be included there as documentation on the dataset, so that anyone who trains a model using it can adhere to our recommendations.

It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.

@apolinario, any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?

sarayourfriend · 2023-07-06T23:41:55Z

It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.

Are HuggingFace datasets all meant for model training? If so, then we can proactively exclude them under that principle. If it is the case that HuggingFace is not meant to host datasets that cannot be used for model training, then I think we should also seek an additional home for the full dataset. Our aggregated catalogue of openly licensed works is useful for things other than model training that wouldn't fall under "derivative works" even sceptically (like research into how certain license conditions are applied, to name one basic thing). Work to produce even a subset of our dataset in a consumable format would undoubtedly benefit a whole-dataset version to be published elsewhere with usage restrictions, so aside from needing to find an additional home for that complete dataset, I don't think this causes problems with the plan generally.

apolinario · 2023-07-06T23:44:20Z

any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?

In my opinion, they should not be excluded from the dataset entirely for a few reasons:

It could lose its reference as "the openverse database dump", which could affect the objectives 1 and 2 of the project aims. Not everyone may want to use the dataset to train ML models. Some people may want to use it as a dump of OpenVerse for any downstream compliant use-cases of the content with the ND license (e.g.: a bias analysis on CC content), and use the HF Datasets format as a convenient to do so in case they don't have 4TB to store the data on a more traditional "dump" format.
Some ML tasks could be performed to ND works that would not, imo (not legal advice) be considered derivative works. For example, one could build a "semantic search" ML application over the entire Openverse dump - that could potentially be contributed upstream to improve the search function of Openverse - a process like this would probably be useful to search across all images and likely not violate the ND license.

I believe a way the dataset card could be framed is that this dataset is as close as possible to a raw datadump of Openverse metadata - and that it should not be used as-is to train models without taking in consideration the licenses - and all the points made above (and others points we may find useful). With time, curated, filtered and deduped datasets more tailored to train models will emerge, and if that is of interest of Openverse, more aligned ones could even be listed as references.

Are HuggingFace datasets all meant for model training?

Not really - the HF Datasets library and Hub can be used for broader use-cases than just model training and I think could be a home for the entire dump, as long as the safeguards are put in place to make sure people don't see it as a simply a full dataset that can be used as-is to train a model without taking into consideration particularities of each license (and I believe we have tools to put those points up clearly)

sarayourfriend · 2023-07-06T23:47:03Z

That sounds great @apolinario, thanks for the insight! That approach sounds ideal for everyone while still doing as much as we can to communicate the intention of licensors.

Skylion007 · 2023-07-09T18:38:58Z

Totally agree with all the points made by @apolinario , we are planning putting together a paper describing how the Openverse infrastructure works, how the data is hosted, and how the content is created. HF Datasets are useful for a variety of data science applications, particularly when they properly support streaming etc.

I would also like to add that I would like reduce scraping for the data hosts as much as possible to by including all the images that we can safely, legally redistribute as an additional objective.

julien-c · 2023-07-10T18:45:04Z

Are HuggingFace datasets all meant for model training?

also want to mention that many datasets (or subsets of datasets) hosted on HF are meant for evaluation not training, notably, all the eval or test splits.

And for instance we apply <meta name="robots" content="noindex,noai,noimageai" /> on those (in addition to human readable disclaimers) so that automated scraping tools hopefully do not lead to model training on them

zackkrida · 2023-07-10T20:56:12Z

Update: This week I am wrapping up the project proposal and implementation plan. The implementation plan PR will get into the technical aspects of how exactly we create the dataset, which fields to include, the preferred output format, and preferred transfer mechanism. I will solicit comments on that PR while it is a draft so I can incorporate advice from community contributors.

openverse-bot · 2023-07-25T00:25:14Z

Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

openverse-bot · 2023-08-09T00:24:04Z

Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2023-08-17T16:50:38Z

Last week, many of the core maintainers of Openverse had the opportunity to meet together synchronously and discuss the efforts around publishing the Openverse dataset. After much discussion, the maintainer team has decided to pause work on this effort for the time being for a few reasons:

Engagement with the community - As we progressed with the project, we realized the paramount importance of deeper collaboration with the broader community, especially experts in the Creative Commons, licensing, and open data domains. We are committed to making decisions that uphold the rights and interests of creators. By pausing, we are taking the necessary time to consult, collaborate, and ensure that the release of this dataset is both appropriate and beneficial, and that it doesn't inadvertently negatively impact creators. This collaborative approach ensures that we are not only aligned with technical standards but also with the ethos and values that the broader community cherishes.
Focus on the core Openverse users - At its heart, Openverse is primarily a search engine. Our central mission has always been to offer an excellent search experience for individuals seeking to discover and utilize creators' works. We believe it is crucial to preserve and enrich the simplicity, effectiveness, and user-centric nature of our search experience. Hence, our commitment remains steadfast in directing our resources and efforts towards projects and features that elevate and refine the core experience of Openverse users: finding openly licensed media for a myriad of use-cases.
Concerns over dataset governance, access, management, and upkeep - The complexity and intricacies related to ensuring proper governance, access control, robust management, and regular upkeep of such an expansive dataset have come to light. We are committed to upholding the highest standards in data stewardship, and at present we do not have the capacity to address any issues that may come up. For instance: What if a creator wants to be removed from the dataset? How do we establish access restrictions that prevent erroneous use of certain licenses? How do we handle reports of sensitive content within the dataset? How do we address reports about incorrect licenses for certain creations? The effort necessary to be appropriate stewards of this data is non-trivial, and we do not want to discount the level of work this might require.
Project prioritization - On assessment of our currently ongoing projects and the plans we had scoped for the rest of the year, we believe that there are other initiatives that currently demand our attention and resources. In pursuing the scoping for the dataset creation, we put on-hold projects that would improve user safety and enable new ways to browse works within Openverse. We also recently revised our expected roadmap for 2023 to reflect the reduced availability of some of our maintainers through the end of the year. Reprioritizing in this way ensures that Openverse continues to evolve and serve its community effectively, and we are redirecting our focus to areas which we believe will bring more immediate value to our users.

We are deeply appreciative of the enthusiasm and support from the community around the Openverse dataset project, particularly @apolinario, @Skylion007, and others. Our decision to pause is by no means an end, but rather a strategic recalibration to ensure we deliver only the best for our community. We look forward to continuing to work with the community members we're currently in collaboration with, along with other institutions operating in a similar space and under related principles.

zackkrida added the 🧭 project: thread An issue used to track a project and its progress label Jul 4, 2023

github-project-automation bot added this to Openverse Project Tracker Jul 4, 2023

github-project-automation bot moved this to Not Started in Openverse Project Tracker Jul 4, 2023

openverse-bot added this to Openverse Backlog Jul 4, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Jul 4, 2023

zackkrida moved this from Not Started to In Kickoff in Openverse Project Tracker Jul 5, 2023

zackkrida self-assigned this Jul 5, 2023

zackkrida changed the title ~~Providing and maintaining an Openverse image dataset~~ Providing and maintaining an Openverse media datasets Jul 6, 2023

zackkrida changed the title ~~Providing and maintaining an Openverse media datasets~~ Providing and maintaining Openverse media datasets Jul 6, 2023

This was referenced Jul 12, 2023

Allow entire dataset to be downloaded en-masse #669

Closed

Project Proposal: Openverse Datasets #2637

Merged

This was referenced Jul 18, 2023

Implementation Plan: Initial data dump creation #2669

Closed

Implementation Plan: Dataset maintenance #2670

Closed

AetherUnbound mentioned this issue Aug 15, 2023

Delete previous project reminder comment before issuing a new one #2836

Closed

AetherUnbound assigned AetherUnbound and unassigned zackkrida Aug 17, 2023

AetherUnbound mentioned this issue Aug 17, 2023

Implementation Plan: Initial data dump creation #2702

Closed

5 tasks

AetherUnbound moved this from In Kickoff to Not slated for 2023 in Openverse Project Tracker Aug 17, 2023

AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Oct 25, 2023

github-project-automation bot moved this from Not slated for 2023 to Shipped in Openverse Project Tracker Oct 25, 2023

AetherUnbound moved this from Shipped to Not slated for 2023 in Openverse Project Tracker Oct 25, 2023

AetherUnbound removed this from Openverse Project Tracker Dec 19, 2023

AetherUnbound removed this from Openverse Backlog Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing and maintaining Openverse media datasets #2545

Providing and maintaining Openverse media datasets #2545

zackkrida commented Jul 4, 2023 •

edited

Loading

sarayourfriend commented Jul 6, 2023

apolinario commented Jul 6, 2023

zackkrida commented Jul 6, 2023

sarayourfriend commented Jul 6, 2023

apolinario commented Jul 6, 2023 •

edited

Loading

sarayourfriend commented Jul 6, 2023

Skylion007 commented Jul 9, 2023

julien-c commented Jul 10, 2023

zackkrida commented Jul 10, 2023

openverse-bot commented Jul 25, 2023

openverse-bot commented Aug 9, 2023

AetherUnbound commented Aug 17, 2023

Providing and maintaining Openverse media datasets #2545

Providing and maintaining Openverse media datasets #2545

Comments

zackkrida commented Jul 4, 2023 • edited Loading

Description

Documents

Issues

Prior Art

sarayourfriend commented Jul 6, 2023

apolinario commented Jul 6, 2023

zackkrida commented Jul 6, 2023

sarayourfriend commented Jul 6, 2023

apolinario commented Jul 6, 2023 • edited Loading

sarayourfriend commented Jul 6, 2023

Skylion007 commented Jul 9, 2023

julien-c commented Jul 10, 2023

zackkrida commented Jul 10, 2023

openverse-bot commented Jul 25, 2023

openverse-bot commented Aug 9, 2023

AetherUnbound commented Aug 17, 2023

zackkrida commented Jul 4, 2023 •

edited

Loading

apolinario commented Jul 6, 2023 •

edited

Loading