-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Providing and maintaining Openverse media datasets #2545
Comments
I got to chat with Zack about this today and have two requests for HuggingFace that I think are worthwhile conditions of the dataset that should be published alongside it:
|
Excelent points @sarayourfriend! Fully in agreement that all these 3 factors are things that have to be absolutely considered for any downstream applications of the dataset, including for training models (and it could also inform the creation of sub-sets that are). It also merits the points from @Skylion007 on the basis of creating and maintaining image mirrors and sub-sets for benchmarking and training models. In regards to how best communicate this points to the dataset users on the Hugging Face Hub, this is usually done in the Dataset Card. I'd be more than happy to work with you on drafting a dataset card to go online with the model. Here's more information on dataset cards and an example. |
@sarayourfriend thanks for capturing the chat here! Here's a sample dataset on huggingface (@Skylion007's dataset, no less!), which should be a decent representation of what ours would look like: https://huggingface.co/datasets/openwebtext I imagine much of what we chatted about could be included there as documentation on the dataset, so that anyone who trains a model using it can adhere to our recommendations. It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more. @apolinario, any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely? |
Are HuggingFace datasets all meant for model training? If so, then we can proactively exclude them under that principle. If it is the case that HuggingFace is not meant to host datasets that cannot be used for model training, then I think we should also seek an additional home for the full dataset. Our aggregated catalogue of openly licensed works is useful for things other than model training that wouldn't fall under "derivative works" even sceptically (like research into how certain license conditions are applied, to name one basic thing). Work to produce even a subset of our dataset in a consumable format would undoubtedly benefit a whole-dataset version to be published elsewhere with usage restrictions, so aside from needing to find an additional home for that complete dataset, I don't think this causes problems with the plan generally. |
In my opinion, they should not be excluded from the dataset entirely for a few reasons:
I believe a way the dataset card could be framed is that this dataset is as close as possible to a raw datadump of Openverse metadata - and that it should not be used as-is to train models without taking in consideration the licenses - and all the points made above (and others points we may find useful). With time, curated, filtered and deduped datasets more tailored to train models will emerge, and if that is of interest of Openverse, more aligned ones could even be listed as references.
Not really - the HF Datasets library and Hub can be used for broader use-cases than just model training and I think could be a home for the entire dump, as long as the safeguards are put in place to make sure people don't see it as a simply a full dataset that can be used as-is to train a model without taking into consideration particularities of each license (and I believe we have tools to put those points up clearly) |
That sounds great @apolinario, thanks for the insight! That approach sounds ideal for everyone while still doing as much as we can to communicate the intention of licensors. |
Totally agree with all the points made by @apolinario , we are planning putting together a paper describing how the Openverse infrastructure works, how the data is hosted, and how the content is created. HF Datasets are useful for a variety of data science applications, particularly when they properly support streaming etc. I would also like to add that I would like reduce scraping for the data hosts as much as possible to by including all the images that we can safely, legally redistribute as an additional objective. |
also want to mention that many datasets (or subsets of datasets) hosted on HF are meant for evaluation not training, notably, all the And for instance we apply |
Update: This week I am wrapping up the project proposal and implementation plan. The implementation plan PR will get into the technical aspects of how exactly we create the dataset, which fields to include, the preferred output format, and preferred transfer mechanism. I will solicit comments on that PR while it is a draft so I can incorporate advice from community contributors. |
Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
1 similar comment
Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Last week, many of the core maintainers of Openverse had the opportunity to meet together synchronously and discuss the efforts around publishing the Openverse dataset. After much discussion, the maintainer team has decided to pause work on this effort for the time being for a few reasons:
We are deeply appreciative of the enthusiasm and support from the community around the Openverse dataset project, particularly @apolinario, @Skylion007, and others. Our decision to pause is by no means an end, but rather a strategic recalibration to ensure we deliver only the best for our community. We look forward to continuing to work with the community members we're currently in collaboration with, along with other institutions operating in a similar space and under related principles. |
Description
This project aims to publish and regularly update datasets for each of Openverse's media types, currently image and audio. We aim to provide access to the data currently served by our API, but which is difficult and costly to access in full.
This project aims to:
Through conversations with and offers from @Skylion007 and @apolinario we've identified HuggingFace as a home for the initial, raw metadata dump. They've graciously agreed to help and/or publish complementary datasets.
This project will need a proposal, and at least one implementation plan for producing the initial dump and/or establishing a mechanism for creating yearly/every-six-month dumps of the Openverse dataset.
Documents
Issues
Prior Art
The text was updated successfully, but these errors were encountered: