Use VISTEC-TP-TH-21 dataset for Thai #24

wannaphong · 2023-08-18T21:51:47Z

Hello! Thank you for new word segmentation. I think you should use VISTEC-TP-TH-21 dataset. It is CC-BY-SA and is the largest social media domain datasets for Thai text processing.

VISTEC-TP-TH-21: https://github.com/mrpeerat/OSKut/tree/main/VISTEC-TP-TH-2021

wannaphong · 2023-08-18T21:52:32Z

You can see more about Thai dataset at https://nlpforthai.com/tasks/word-segmentation/

sffc · 2023-08-18T23:52:27Z

CC @younies @SahandFarhoodi

0saurabh0 · 2024-03-23T09:31:50Z

@sffc, for the clarification, the intention is to train a new model using this dataset or to retrain an existing one?

sffc · 2024-03-29T21:49:53Z

The intention is to design a model that outperforms existing models on one or more axes including memory usage, model size, accuracy, and performance. Re-training using the new data set is also something we should explore, but that is separate from the chosen model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use VISTEC-TP-TH-21 dataset for Thai #24

Use VISTEC-TP-TH-21 dataset for Thai #24

wannaphong commented Aug 18, 2023

wannaphong commented Aug 18, 2023

sffc commented Aug 18, 2023

0saurabh0 commented Mar 23, 2024 •

edited

Loading

sffc commented Mar 29, 2024

Use VISTEC-TP-TH-21 dataset for Thai #24

Use VISTEC-TP-TH-21 dataset for Thai #24

Comments

wannaphong commented Aug 18, 2023

wannaphong commented Aug 18, 2023

sffc commented Aug 18, 2023

0saurabh0 commented Mar 23, 2024 • edited Loading

sffc commented Mar 29, 2024

0saurabh0 commented Mar 23, 2024 •

edited

Loading