From 3f70e4387609a260e9ee6f34797e56c1436a2cf6 Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Fri, 7 Jul 2023 11:21:17 -0400
Subject: [PATCH 01/12] WIP

---
 .../20230706-project_proposal.md              | 86 +++++++++++++++++++
 .../proposals/publish_dataset/index.md        |  8 ++
 2 files changed, 94 insertions(+)
 create mode 100644 documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
 create mode 100644 documentation/projects/proposals/publish_dataset/index.md

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
new file mode 100644
index 00000000000..f8eb721034a
--- /dev/null
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -0,0 +1,86 @@
+# Project Proposal: Providing and maintaining an Openverse image dataset
+
+**Author**: @zackkrida
+
+## Reviewers
+
+<!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->
+
+- [ ] @AetherUnbound
+- [ ] @stacimc
+
+## Project summary
+
+This project aims to publish and regularly update a dataset of Openverse' image
+metadata. This project will provide access to the data currently served by our
+API, but which is difficult to access in full and requires significant time and
+resources.
+
+## Motivation
+
+For this project, understanding "why?" we would do this is of paramount
+importance. There are several key philosophical and technical advantages to
+sharing this dataset with the public.
+
+### Open Access
+
+Openverse is and always has been, since its days as
+["CC Search"](https://creativecommons.org/2021/12/13/dear-users-of-cc-search-welcome-to-openverse/),
+informed by principles of the open access movement. Openverse has strived to
+remove all all financial, technical, and legal barriers to accessing the works
+of the global commons.
+
+Due to technical and logistical limitations, we have previously been unable to
+accessibly provide access to the full Openverse dataset. Currently, users need
+to invest significant time and money into scraping the Openverse API. These
+financial and technical barriers to our users are deeply inequitable.
+
+By sharing this data as a full dataset on
+[HuggingFace](https://huggingface.co/datasets), we can remove these barriers to
+access and allow folks to access the data provided by Openverse.org and the
+Openverse API without restriction.
+
+The metadata for openly-licensed media, used by Openverse to power the API and
+frontend, is a community utility and should be available to all users,
+distinctly from Openverse itself. By publishing this dataset, we ensure access
+to this data is fast, accessible, and resilient.
+
+## Goals
+
+<!-- Which yearly goal does this project advance? -->
+
+This project encompasses all of our 2023 "lighthouse goals", but "Community
+Development" is perhaps the broadest and most relevant. Others touched on here,
+or impacted through potential downstream changes, are "Result Relevancy",
+"Quantifying our Work","Search Experience", "Content Safety", and "Data
+inertia".
+
+I'll explain _how_ this project may impact those goals in subsequent sections.
+
+## Requirements
+
+<!-- Detailed descriptions of the features required for the project. Include user stories if you feel they'd be helpful, but focus on describing a specification for how the feature would work with an eye towards edge cases. -->
+
+## Success
+
+<!-- How do we measure the success of the project? How do we know our ideas worked? -->
+
+## Participants and stakeholders
+
+<!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->
+
+## Infrastructure
+
+<!-- What infrastructural considerations need to be made for this project? If there are none, say so explicitly rather than deleting the section. -->
+
+## Accessibility
+
+<!-- Are there specific accessibility concerns relevant to this project? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. -->
+
+## Marketing
+
+<!-- Are there potential marketing opportunities that we'd need to coordinate with the community to accomplish? If there are none, say so explicitly rather than deleting the section. -->
+
+## Required implementation plans
+
+<!-- What are the required implementation plans? Consider if they should be split per level of the stack or per feature. -->
diff --git a/documentation/projects/proposals/publish_dataset/index.md b/documentation/projects/proposals/publish_dataset/index.md
new file mode 100644
index 00000000000..dfce782263b
--- /dev/null
+++ b/documentation/projects/proposals/publish_dataset/index.md
@@ -0,0 +1,8 @@
+# Providing and maintaining an Openverse image dataset
+
+```{toctree}
+:titlesonly:
+:glob:
+
+*
+```

From c048d87c08fc1a67ccba1eee326629943cf6abbf Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Tue, 11 Jul 2023 17:31:10 -0400
Subject: [PATCH 02/12] WIP

---
 .../20230706-project_proposal.md              | 109 +++++++++++++++---
 1 file changed, 91 insertions(+), 18 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index f8eb721034a..a6bec883afa 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -11,10 +11,10 @@
 
 ## Project summary
 
-This project aims to publish and regularly update a dataset of Openverse' image
-metadata. This project will provide access to the data currently served by our
-API, but which is difficult to access in full and requires significant time and
-resources.
+This project aims to publish and regularly update a dataset or _datasets_ of
+Openverse' media metadata. This project will provide access to the data
+currently served by our API, but which is difficult to access in full, requires
+significant time, money, and compute resources.
 
 ## Motivation
 
@@ -22,28 +22,77 @@ For this project, understanding "why?" we would do this is of paramount
 importance. There are several key philosophical and technical advantages to
 sharing this dataset with the public.
 
-### Open Access
+### Open Access + Scraping Prevention
 
 Openverse is and always has been, since its days as
 ["CC Search"](https://creativecommons.org/2021/12/13/dear-users-of-cc-search-welcome-to-openverse/),
-informed by principles of the open access movement. Openverse has strived to
-remove all all financial, technical, and legal barriers to accessing the works
-of the global commons.
+informed by principles of the open access movement. Openverse strives to remove
+all all financial, technical, and legal barriers to accessing the works of the
+global commons.
 
 Due to technical and logistical limitations, we have previously been unable to
-accessibly provide access to the full Openverse dataset. Currently, users need
-to invest significant time and money into scraping the Openverse API. These
-financial and technical barriers to our users are deeply inequitable.
+accessibly provide access to the full Openverse dataset. Today, users need to
+invest significant time and money into scraping the Openverse API in order to
+access this data. These financial and technical barriers to our users are deeply
+inequitable. Additionally, this scraping disrupts Openverse access and stability
+for all users. It also requires significant maintainer effort to identify,
+mitigate, and block scraping traffic.
 
 By sharing this data as a full dataset on
-[HuggingFace](https://huggingface.co/datasets), we can remove these barriers to
-access and allow folks to access the data provided by Openverse.org and the
-Openverse API without restriction.
+[HuggingFace](https://huggingface.co/datasets), we can remove these barriers
+allow folks to access the data provided by Openverse.org and the Openverse API
+without restriction.
+
+### Contributions Back to Openverse
+
+Easily accessed Openverse datasets will facilitate easier generation of machine
+labels, translations, and other supplemental data which can be used to improve
+the experience of Openverse.org and the API. This data is typically generated as
+part of the
+[data preprocessing](https://huggingface.co/docs/transformers/preprocessing)
+stage of model training.
+
+HuggingFace in particular will enable community members to analyze the dataset
+and create supplemental datasets; to train models with the dataset; and to use
+the dataset with all of HuggingFace's tooling: the
+[Datasets](https://github.com/huggingface/datasets) library in particular.
+
+It is worth noting that this year we identified many projects to work on which
+rely on bulk analysis of Openverse's data. These projects could be replaced by,
+or made easier by the publication of the datasets. This could work in a few
+ways. A community member, training a model using the Openverse dataset,
+generates metadata that we want and planned to generate ourselves. Then, the
+HuggingFace platform presents an alternative to other SaSS products we intended
+to use to generate machine labels, detect sensitive content, and so on. Instead
+of those offerings we use models hosted on the HuggingFace hub. The
+[Datasets library](https://huggingface.co/docs/datasets/index) allows for easy
+loading of the Openverse dataset in any data pipelines we write. HuggingFace
+also offers the ability to deploy production-ready API endpoints for
+transformation models hosted on their hub. This feature is called
+[Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index).
+
+#### Potentially-related projects
+
+- [External API for sensitive content detection #422](https://github.com/WordPress/openverse/issues/422) -
+  Instead of Rekognition or Google Vision, we may want to use a community model
+  on HuggingFace.
+- [Machine Image Labeling pipeline #426](https://github.com/WordPress/openverse/issues/426)
+  Either a community member may create an up to date set of machine-generated
+  labels we could use, or we could again create an Inference Endpoint with an
+  existing model, for example
+  [Vision Transformer](https://huggingface.co/google/vit-base-patch16-224)
+- [Duplicate image detection #427](https://github.com/WordPress/openverse/issues/427) -
+  Deduplication is an extremely common preprocessing step. We should be able to
+  apply the deduplication from
+
+### Resiliency
 
 The metadata for openly-licensed media, used by Openverse to power the API and
 frontend, is a community utility and should be available to all users,
-distinctly from Openverse itself. By publishing this dataset, we ensure access
-to this data is fast, accessible, and resilient.
+_distinctly_ from Openverse itself. By publishing this dataset, we ensure access
+to this data is fast, accessible, and resilient. With published datasets, this
+access remains even if Openverse is inaccessible, under attack, or experiencing
+any other unforeseen disruptions.
 
 ## Goals
 
@@ -55,20 +104,44 @@ or impacted through potential downstream changes, are "Result Relevancy",
 "Quantifying our Work","Search Experience", "Content Safety", and "Data
 inertia".
 
-I'll explain _how_ this project may impact those goals in subsequent sections.
-
 ## Requirements
 
 <!-- Detailed descriptions of the features required for the project. Include user stories if you feel they'd be helpful, but focus on describing a specification for how the feature would work with an eye towards edge cases. -->
 
+This project requires coordination with HuggingFace to release the dataset on
+their platform, bypassing their typical restrictions for Dataset size.
+
+We will also need to figure out the technical requirements for producing the raw
+dataset, which will be done in this project's implementation plan. Additionally
+in that plan we will:
+
+- Determine if we host a single dataset for all media types, or separate
+  datasets for different media types.
+- Develop a plan for updating the datasets regularly
+- Refine how we will provide access to our raw dataset for upload to
+  HuggingFace:
+  - Delivery mechanism (a requester pays S3 bucket?)
+  - File format (parquet files? TSVs? a Postgres dump?)
+
+We will also need to coordinate the launch of these efforts and associated
+outreach. See more about that in the [marketing section](#marketing).
+
 ## Success
 
 <!-- How do we measure the success of the project? How do we know our ideas worked? -->
 
+This project can be considered
+
 ## Participants and stakeholders
 
 <!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->
 
+- Openverse maintainers
+- HuggingFace
+- Creative Commons
+- Aaron Ghokaslan
+- WordPress Foundation
+
 ## Infrastructure
 
 <!-- What infrastructural considerations need to be made for this project? If there are none, say so explicitly rather than deleting the section. -->

From 068ae0e634145e39b674ab14a203bcde754a3437 Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Wed, 12 Jul 2023 07:34:48 -0400
Subject: [PATCH 03/12] WIP

---
 .../20230706-project_proposal.md              | 62 ++++++++++++++++---
 1 file changed, 55 insertions(+), 7 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index a6bec883afa..cff903dc02e 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -7,7 +7,7 @@
 <!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->
 
 - [ ] @AetherUnbound
-- [ ] @stacimc
+- [ ] @sarayourfriend
 
 ## Project summary
 
@@ -130,30 +130,78 @@ outreach. See more about that in the [marketing section](#marketing).
 
 <!-- How do we measure the success of the project? How do we know our ideas worked? -->
 
-This project can be considered
+This project can be considered a success when the dataset is published. Ideally,
+we will also observe meaningful usage of the dataset. Some ways we might measure
+this include:
+
+- Metrics built into HuggingFace
+  - Models trained with the Dataset are listed
+  - Downloads last month
+  - Likes
+- Our dataset trending on HuggingFace
+- Additionally, we may also see increased interest in our repositories
+- Any positive engagement with our marketing efforts for the project
 
 ## Participants and stakeholders
 
 <!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->
 
-- Openverse maintainers
-- HuggingFace
-- Creative Commons
-- Aaron Ghokaslan
-- WordPress Foundation
+- Openverse maintainers - Responsible for creating the initial raw data dump,
+  maintaining the Openverse account and Dataset on HuggingFace
+- HuggingFace - A key partner responsible for the initial dataset upload,
+  providing advice, and potential marketing collaboration
+- Creative Commons - Stewards of the Commons and CC Licenses, advisors, and
+  another partner in marketing promotion
+- Aaron Ghokaslan - A researcher working on supplementary datasets and providing
+  technical advice
 
 ## Infrastructure
 
 <!-- What infrastructural considerations need to be made for this project? If there are none, say so explicitly rather than deleting the section. -->
 
+This project will likely require provisioning some new resources in AWS:
+
+- A dedicated bucket, perhaps a "Requester pays" bucket, for storing the backups
+- New scripts to generate backup artifacts
+
 ## Accessibility
 
 <!-- Are there specific accessibility concerns relevant to this project? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. -->
 
+This project doesn't directly raise any accessibility concerns. However, we
+should be mindful of any changes we would like to make on Openverse.org relating
+to copy edits about this initiative.
+
+We should also be mindful of any accessibility issues with HuggingFace's user
+interface, which we could share with them in an advisory capacity.
+
 ## Marketing
 
 <!-- Are there potential marketing opportunities that we'd need to coordinate with the community to accomplish? If there are none, say so explicitly rather than deleting the section. -->
 
+This release will be a big achievement and we should do quite a bit to promote
+it:
+
+- Reach out to past requesters of the dataset and share the HuggingFace link
+- Social channel cross-promotion between the WordPress Marketing team,
+  HuggingFace, and/or Creative Commons
+- Post to more tech-minded communities like HackerNews, certain Reddit
+  communities, etc.
+
+Additionally, our documentation will need to be updated extensively to inform
+users about the Dataset. The API docs, our developer handbook, our docs site,
+and potentially Openverse.org should all be update to reflect these changes.
+
 ## Required implementation plans
 
 <!-- What are the required implementation plans? Consider if they should be split per level of the stack or per feature. -->
+
+- Initial Data Dump Creation - A plan describing how to produce and provide
+  access to the raw data dumps which will be used to create the Dataset(s).
+  - This will be the first, largest, and most important plan
+- Dataset Maintenance - A plan describing how we will regularly release updates
+  to the Dataset(s)
+
+We will also want a plan for how we intend to _use_ the HuggingFace platform to
+complete our other projects for the year, but that might fall outside the scope
+of this project.

From c68194e5db598ff98dbd630d56a8f71bd3f2e339 Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Mon, 17 Jul 2023 16:21:40 -0400
Subject: [PATCH 04/12] Apply suggestions from code review

Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
---
 .../publish_dataset/20230706-project_proposal.md       | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index cff903dc02e..c966740fa4b 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -12,9 +12,9 @@
 ## Project summary
 
 This project aims to publish and regularly update a dataset or _datasets_ of
-Openverse' media metadata. This project will provide access to the data
-currently served by our API, but which is difficult to access in full, requires
-significant time, money, and compute resources.
+Openverse's media metadata. This project will provide access to the data
+currently served by our API, but which is difficult to access in full and requires
+significant time, money, and compute resources to maintain.
 
 ## Motivation
 
@@ -39,7 +39,7 @@ for all users. It also requires significant maintainer effort to identify,
 mitigate, and block scraping traffic.
 
 By sharing this data as a full dataset on
-[HuggingFace](https://huggingface.co/datasets), we can remove these barriers
+[HuggingFace](https://huggingface.co/datasets), we can remove these barriers and
 allow folks to access the data provided by Openverse.org and the Openverse API
 without restriction.
 
@@ -52,7 +52,7 @@ part of the
 [data preprocessing](https://huggingface.co/docs/transformers/preprocessing)
 stage of model training.
 
-HuggingFace in particular will enable community members to analyze the dataset
+Presence on HuggingFace in particular will enable community members to analyze the dataset
 and create supplemental datasets; to train models with the dataset; and to use
 the dataset with all of HuggingFace's tooling: the
 [Datasets](https://github.com/huggingface/datasets) library in particular.

From 9a5d5987b41cc4cf93897cf84977b601d1f650f9 Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Mon, 17 Jul 2023 16:22:10 -0400
Subject: [PATCH 05/12] format after changes

---
 .../publish_dataset/20230706-project_proposal.md       | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index c966740fa4b..17757629e6f 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -13,8 +13,8 @@
 
 This project aims to publish and regularly update a dataset or _datasets_ of
 Openverse's media metadata. This project will provide access to the data
-currently served by our API, but which is difficult to access in full and requires
-significant time, money, and compute resources to maintain.
+currently served by our API, but which is difficult to access in full and
+requires significant time, money, and compute resources to maintain.
 
 ## Motivation
 
@@ -52,9 +52,9 @@ part of the
 [data preprocessing](https://huggingface.co/docs/transformers/preprocessing)
 stage of model training.
 
-Presence on HuggingFace in particular will enable community members to analyze the dataset
-and create supplemental datasets; to train models with the dataset; and to use
-the dataset with all of HuggingFace's tooling: the
+Presence on HuggingFace in particular will enable community members to analyze
+the dataset and create supplemental datasets; to train models with the dataset;
+and to use the dataset with all of HuggingFace's tooling: the
 [Datasets](https://github.com/huggingface/datasets) library in particular.
 
 It is worth noting that this year we identified many projects to work on which

From ff2b6ae6226c51497e6f570764c725bc8e8e6f62 Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Mon, 17 Jul 2023 16:26:56 -0400
Subject: [PATCH 06/12] Deduplication cleanup

---
 .../publish_dataset/20230706-project_proposal.md          | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index 17757629e6f..18def9145cc 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -82,8 +82,12 @@ transformation models hosted on their hub. This feature is called
   existing model, for example
   [Vision Transformer](https://huggingface.co/google/vit-base-patch16-224)
 - [Duplicate image detection #427](https://github.com/WordPress/openverse/issues/427) -
-  Deduplication is an extremely common preprocessing step. We should be able to
-  apply the deduplication from
+  Deduplication is an extremely common and important preprocessing step. It will
+  surely be implemented by community contributors and can be reapplied to
+  Openverse itself. It's possible we'll want a different deduplication strategy
+  that considers provenance and ownership (i.e, did the original author upload
+  or someone else), but at a minimum, we should be able to use community
+  solutions to _identify_ duplicates.
 
 ### Resiliency
 

From a9aba74bac74deb871f901d065cdcb5225f7ea2c Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Mon, 17 Jul 2023 16:30:59 -0400
Subject: [PATCH 07/12] Clarify initial dump considerations

---
 .../publish_dataset/20230706-project_proposal.md       | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index 18def9145cc..d4d0aa5b38d 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -122,10 +122,12 @@ in that plan we will:
 - Determine if we host a single dataset for all media types, or separate
   datasets for different media types.
 - Develop a plan for updating the datasets regularly
-- Refine how we will provide access to our raw dataset for upload to
-  HuggingFace:
-  - Delivery mechanism (a requester pays S3 bucket?)
-  - File format (parquet files? TSVs? a Postgres dump?)
+- Refine how we will provide access to our initial, raw dataset for upload to
+  HuggingFace. This refers to the initial, raw dump from Openverse, not the
+  actual dataset which will be provided to users. These details are somewhat
+  trivial as the data can be parsed and transformed prior to distribution.
+  - Delivery mechanism, likely a requester pays S3 bucket
+  - File format, likely parquet files
 
 We will also need to coordinate the launch of these efforts and associated
 outreach. See more about that in the [marketing section](#marketing).

From 1adbaa203e7c2549cd21da0218165a876005006e Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Mon, 17 Jul 2023 16:41:38 -0400
Subject: [PATCH 08/12] Clarification from Saras review

---
 .../20230706-project_proposal.md              | 22 ++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index d4d0aa5b38d..19ef218acbf 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -153,13 +153,22 @@ this include:
 <!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->
 
 - Openverse maintainers - Responsible for creating the initial raw data dump,
-  maintaining the Openverse account and Dataset on HuggingFace
+  maintaining the Openverse account and Dataset on HuggingFace. We also need to
+  make sure maintainers are protected from liability related the dataset, for
+  example: from distributing PDM works, works acquired by institutions without
+  consent or input from their cultures of origin, or copyrighted works
+  incorrectly marked as CC licensed.
+- CC Licensors with works in Openverse - It is critical that we respect their
+  intentions and properly communicate the usage conditions for different license
+  attributes (NC, ND, SA, and so on) in our Dataset documentation. We also need
+  to spread awareness of the opt-in/out mechanism
+  [Spawning AI](https://spawning.ai/) which is integrated with HuggingFace.
 - HuggingFace - A key partner responsible for the initial dataset upload,
   providing advice, and potential marketing collaboration
 - Creative Commons - Stewards of the Commons and CC Licenses, advisors, and
   another partner in marketing promotion
-- Aaron Ghokaslan - A researcher working on supplementary datasets and providing
-  technical advice
+- Aaron Gokaslan & MosaicML - A researcher working on supplementary datasets and
+  providing technical advice
 
 ## Infrastructure
 
@@ -204,9 +213,12 @@ and potentially Openverse.org should all be update to reflect these changes.
 
 - Initial Data Dump Creation - A plan describing how to produce and provide
   access to the raw data dumps which will be used to create the Dataset(s).
-  - This will be the first, largest, and most important plan
+  Additionally, this plan should address the marketing and documentation of the
+  initial data dump. Essentially, all facets of the project relating to the
+  initial release.
+  - This is the first, largest, and most important plan.
 - Dataset Maintenance - A plan describing how we will regularly release updates
-  to the Dataset(s)
+  to the Dataset(s).
 
 We will also want a plan for how we intend to _use_ the HuggingFace platform to
 complete our other projects for the year, but that might fall outside the scope

From 8f5478176f76bf5628267a77d3e4970aaf7596dd Mon Sep 17 00:00:00 2001
From: Zack Krida <zackkrida@pm.me>
Date: Tue, 18 Jul 2023 11:22:43 -0400
Subject: [PATCH 09/12] Apply suggestions from code review

---
 .../proposals/publish_dataset/20230706-project_proposal.md    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index 19ef218acbf..de427d697b1 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -6,8 +6,8 @@
 
 <!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->
 
-- [ ] @AetherUnbound
-- [ ] @sarayourfriend
+- [x] @AetherUnbound
+- [x] @sarayourfriend
 
 ## Project summary
 

From 6f5951a191b3df9c6dd26154cedee5f81e35d8ac Mon Sep 17 00:00:00 2001
From: Madison Swain-Bowden <bowdenm@spu.edu>
Date: Thu, 17 Aug 2023 10:04:28 -0700
Subject: [PATCH 10/12] Fix some typos

---
 .../publish_dataset/20230706-project_proposal.md         | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index de427d697b1..bd0f5dfc247 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -62,9 +62,10 @@ rely on bulk analysis of Openverse's data. These projects could be replaced by,
 or made easier by the publication of the datasets. This could work in a few
 ways. A community member, training a model using the Openverse dataset,
 generates metadata that we want and planned to generate ourselves. Then, the
-HuggingFace platform presents an alternative to other SaSS products we intended
-to use to generate machine labels, detect sensitive content, and so on. Instead
-of those offerings we use models hosted on the HuggingFace hub. The
+HuggingFace platform presents an alternative to other SaaS (software as a
+service) products we intended to use to generate machine labels, detect
+sensitive content, and so on. Instead of those offerings we use models hosted on
+the HuggingFace hub. The
 [Datasets library](https://huggingface.co/docs/datasets/index) allows for easy
 loading of the Openverse dataset in any data pipelines we write. HuggingFace
 also offers the ability to deploy production-ready API endpoints for
@@ -105,7 +106,7 @@ any other unforeseen disruptions.
 This project encompasses all of our 2023 "lighthouse goals", but "Community
 Development" is perhaps the broadest and most relevant. Others touched on here,
 or impacted through potential downstream changes, are "Result Relevancy",
-"Quantifying our Work","Search Experience", "Content Safety", and "Data
+"Quantifying our Work", "Search Experience", "Content Safety", and "Data
 inertia".
 
 ## Requirements

From f90ec8205bb7db7a4018c252e46aed52f0e647ea Mon Sep 17 00:00:00 2001
From: Madison Swain-Bowden <bowdenm@spu.edu>
Date: Thu, 17 Aug 2023 10:06:18 -0700
Subject: [PATCH 11/12] Add note about pausing efforts

---
 .../proposals/publish_dataset/20230706-project_proposal.md  | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
index bd0f5dfc247..2b35d9ed857 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
@@ -1,5 +1,11 @@
 # Project Proposal: Providing and maintaining an Openverse image dataset
 
+```{note}
+The Openverse maintainer team has decided to pause this project for the time
+being. [See our public update on it here](https://github.com/WordPress/openverse/issues/2545#issuecomment-1682627814).
+This proposal may be picked back up when the discussion reopens in the future.
+```
+
 **Author**: @zackkrida
 
 ## Reviewers

From 923123db2c85d0f964b55004601a097dd11c9030 Mon Sep 17 00:00:00 2001
From: Madison Swain-Bowden <bowdenm@spu.edu>
Date: Thu, 17 Aug 2023 10:16:05 -0700
Subject: [PATCH 12/12] Rename, standardize title

---
 ...osal.md => 20230706-project_proposal_openverse_dataset.md} | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
 rename documentation/projects/proposals/publish_dataset/{20230706-project_proposal.md => 20230706-project_proposal_openverse_dataset.md} (99%)

diff --git a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md b/documentation/projects/proposals/publish_dataset/20230706-project_proposal_openverse_dataset.md
similarity index 99%
rename from documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
rename to documentation/projects/proposals/publish_dataset/20230706-project_proposal_openverse_dataset.md
index 2b35d9ed857..07196c72880 100644
--- a/documentation/projects/proposals/publish_dataset/20230706-project_proposal.md
+++ b/documentation/projects/proposals/publish_dataset/20230706-project_proposal_openverse_dataset.md
@@ -1,6 +1,6 @@
-# Project Proposal: Providing and maintaining an Openverse image dataset
+# 2023-07-06 Project Proposal: Providing and maintaining an Openverse image dataset
 
-```{note}
+```{attention}
 The Openverse maintainer team has decided to pause this project for the time
 being. [See our public update on it here](https://github.com/WordPress/openverse/issues/2545#issuecomment-1682627814).
 This proposal may be picked back up when the discussion reopens in the future.