Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other language data #93

Open
Dzg0309 opened this issue Dec 22, 2023 · 4 comments
Open

Other language data #93

Dzg0309 opened this issue Dec 22, 2023 · 4 comments

Comments

@Dzg0309
Copy link

Dzg0309 commented Dec 22, 2023

Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages, such as Chinese? I think Chinese data is also a need for most people.

@mauriceweber
Copy link
Collaborator

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

@Dzg0309
Copy link
Author

Dzg0309 commented Jan 9, 2024

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Thank you very much for your reply. It is very difficult for us to filter Chinese data from the original large-scale CommonCrawl because we cannot handle such a large CC dump package. Is there a channel to obtain language-differentiated data? Chinese raw data? In this way, we can process and generate Chinese data based on CCNet and the library you provided.

@davidrpugh
Copy link

@mauriceweber I am a faculty member at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. I am about to kick-off a project to apply these workflows to prepare the Arabic language subset with the goal of contributing the Arabic language subset to the next version of this dataset. Would there be interest in collaborating on this project? We have technical skills and plenty of compute so what we really need is general guidance if we get stuck.

@Dzg0309 depending on how much resources we need to use to prepare the Arabic data we may be able to also prepare the data for other languages.

@mauriceweber
Copy link
Collaborator

mauriceweber commented May 16, 2024

Hi @davidrpugh , awesome to hear that! I'm happy to provide any guidance you need and open for collaboration on this!:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants