2 Versions

DATASET PREPROCESSOR

Python Script for preprocessing Tagalog Datasets. (modifiable for all language)

Cleans text data by removing html tags and symbols.
Remove other languages or sentences with mixed language (Tagalog is default but can be modified to other languages.)
Batch size is modifiable depending on memory. (Parallel Processing not available yet. 1024 bytes default)
Checkpoints for FastText version to handle large datasets.

Note: This project is defaulted to tagalog but can be modified on different language available to langdetect library.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
tl_dataset_preprocessor.py		tl_dataset_preprocessor.py
tl_dataset_preprocessor_fasttext.py		tl_dataset_preprocessor_fasttext.py