Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange datasets for Yue Chinese corpus #1

Open
cosmeowpawlitan opened this issue Jun 17, 2021 · 2 comments
Open

strange datasets for Yue Chinese corpus #1

cosmeowpawlitan opened this issue Jun 17, 2021 · 2 comments
Labels
lang:yue Language: Yue Chinese ver:2019 Version: OSCAR 2019

Comments

@cosmeowpawlitan
Copy link

image
image
From the official site , the Yue Chinese dataset should have 2.2KB data.
7 training instances is obviously not a right number.
As I can read Yue Chinese, I call tell the last instance is definitely not something that would appear on Common Crawl.
And even if you don't read Yue Chinese, you can tell the first six instance are problematic.
(It is embarrassing, as the 7 training instances look exactly like something from a pornographic novel or flitting messages in a chat of a dating app)
It might not be the problem of the huggingface/datasets implementation, because when I tried to download the dataset from the official site, I found out that the zip file is corrupted.
I will try to inform the host of OSCAR corpus later.
Awy a remake about this dataset in huggingface/datasets is needed, perhaps after the host of the dataset fixes the issue.

the post is copied from a post about the same issue on huggingface's repository: huggingface/datasets#2396

@Uinelj Uinelj transferred this issue from oscar-project/oscar-website Oct 18, 2021
@Uinelj Uinelj added lang:yue Language: Yue Chinese ver:2019 Version: OSCAR 2019 labels Oct 18, 2021
@Uinelj Uinelj added this to OSCAR Feb 10, 2022
@ayaka14732
Copy link

I have been working on Cantonese (aka Yue Chinese) NLP for a long time. Are there any guides on how I can help?

@Uinelj
Copy link
Member

Uinelj commented Oct 27, 2022

We don't have a guide yet (but we'll be working on it), but improving the quality of the corpus in a specific language can be done by either finding good resources for the training of language identification models, or validating the output data by manually inspecting the corpus.
These two tasks are the most direct way of helping to build better corpora in the end, but obviously if you have other suggestions, ressources that could be useful to improve data quality, do tell! 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang:yue Language: Yue Chinese ver:2019 Version: OSCAR 2019
Projects
Status: No status
Development

No branches or pull requests

3 participants