You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the official site , the Yue Chinese dataset should have 2.2KB data.
7 training instances is obviously not a right number.
As I can read Yue Chinese, I call tell the last instance is definitely not something that would appear on Common Crawl.
And even if you don't read Yue Chinese, you can tell the first six instance are problematic.
(It is embarrassing, as the 7 training instances look exactly like something from a pornographic novel or flitting messages in a chat of a dating app)
It might not be the problem of the huggingface/datasets implementation, because when I tried to download the dataset from the official site, I found out that the zip file is corrupted.
I will try to inform the host of OSCAR corpus later.
Awy a remake about this dataset in huggingface/datasets is needed, perhaps after the host of the dataset fixes the issue.
the post is copied from a post about the same issue on huggingface's repository: huggingface/datasets#2396
The text was updated successfully, but these errors were encountered:
Uinelj
transferred this issue from oscar-project/oscar-website
Oct 18, 2021
We don't have a guide yet (but we'll be working on it), but improving the quality of the corpus in a specific language can be done by either finding good resources for the training of language identification models, or validating the output data by manually inspecting the corpus.
These two tasks are the most direct way of helping to build better corpora in the end, but obviously if you have other suggestions, ressources that could be useful to improve data quality, do tell! 😁
From the official site , the Yue Chinese dataset should have 2.2KB data.
7 training instances is obviously not a right number.
As I can read Yue Chinese, I call tell the last instance is definitely not something that would appear on Common Crawl.
And even if you don't read Yue Chinese, you can tell the first six instance are problematic.
(It is embarrassing, as the 7 training instances look exactly like something from a pornographic novel or flitting messages in a chat of a dating app)
It might not be the problem of the huggingface/datasets implementation, because when I tried to download the dataset from the official site, I found out that the zip file is corrupted.
I will try to inform the host of OSCAR corpus later.
Awy a remake about this dataset in huggingface/datasets is needed, perhaps after the host of the dataset fixes the issue.
the post is copied from a post about the same issue on huggingface's repository: huggingface/datasets#2396
The text was updated successfully, but these errors were encountered: