Python Script for preprocessing Tagalog Datasets. (modifiable for all language)
- Lang Library version - Used for smaller datasets (no longer be updated)
- FastText Version - Used for Relatively Large Datasets
- Cleans text data by removing html tags and symbols.
- Remove other languages or sentences with mixed language (Tagalog is default but can be modified to other languages.)
- Batch size is modifiable depending on memory. (Parallel Processing not available yet. 1024 bytes default)
- Checkpoints for FastText version to handle large datasets.
Note: This project is defaulted to tagalog but can be modified on different language available to langdetect library.
- Config file for easy modification.
- Parallel Batch Computing