-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use VISTEC-TP-TH-21 dataset for Thai #24
Comments
You can see more about Thai dataset at https://nlpforthai.com/tasks/word-segmentation/ |
@sffc, for the clarification, the intention is to train a new model using this dataset or to retrain an existing one? |
The intention is to design a model that outperforms existing models on one or more axes including memory usage, model size, accuracy, and performance. Re-training using the new data set is also something we should explore, but that is separate from the chosen model. |
Hello! Thank you for new word segmentation. I think you should use VISTEC-TP-TH-21 dataset. It is CC-BY-SA and is the largest social media domain datasets for Thai text processing.
VISTEC-TP-TH-21: https://github.com/mrpeerat/OSKut/tree/main/VISTEC-TP-TH-2021
The text was updated successfully, but these errors were encountered: