strange datasets for Yue Chinese corpus #1

cosmeowpawlitan · 2021-06-17T13:54:02Z

From the official site , the Yue Chinese dataset should have 2.2KB data.
7 training instances is obviously not a right number.
As I can read Yue Chinese, I call tell the last instance is definitely not something that would appear on Common Crawl.
And even if you don't read Yue Chinese, you can tell the first six instance are problematic.
(It is embarrassing, as the 7 training instances look exactly like something from a pornographic novel or flitting messages in a chat of a dating app)
It might not be the problem of the huggingface/datasets implementation, because when I tried to download the dataset from the official site, I found out that the zip file is corrupted.
I will try to inform the host of OSCAR corpus later.
Awy a remake about this dataset in huggingface/datasets is needed, perhaps after the host of the dataset fixes the issue.

the post is copied from a post about the same issue on huggingface's repository: huggingface/datasets#2396

ayaka14732 · 2022-10-27T06:46:29Z

I have been working on Cantonese (aka Yue Chinese) NLP for a long time. Are there any guides on how I can help?

Uinelj · 2022-10-27T08:28:09Z

We don't have a guide yet (but we'll be working on it), but improving the quality of the corpus in a specific language can be done by either finding good resources for the training of language identification models, or validating the output data by manually inspecting the corpus.
These two tasks are the most direct way of helping to build better corpora in the end, but obviously if you have other suggestions, ressources that could be useful to improve data quality, do tell! 😁

Uinelj transferred this issue from oscar-project/oscar-website Oct 18, 2021

Uinelj added lang:yue Language: Yue Chinese ver:2019 Version: OSCAR 2019 labels Oct 18, 2021

Uinelj added this to OSCAR Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange datasets for Yue Chinese corpus #1

strange datasets for Yue Chinese corpus #1

cosmeowpawlitan commented Jun 17, 2021

ayaka14732 commented Oct 27, 2022

Uinelj commented Oct 27, 2022

strange datasets for Yue Chinese corpus #1

strange datasets for Yue Chinese corpus #1

Comments

cosmeowpawlitan commented Jun 17, 2021

ayaka14732 commented Oct 27, 2022

Uinelj commented Oct 27, 2022