This repository contains state of the art Language models and Classifier for Code mixed Manglish (Malayalam and English) - spoken in Indian sub-continent.
-
Malayalam Wikipedia Articles : Preprocessed, Transliterated and Translated versions of this dataset, used for language modeling in this repo, can be downloaded directly from here
Architecture/Dataset | Malayalam Wikipedia Articles | Malayalam Wikipedia Articles |
---|---|---|
Latin Script | Mixed Script | |
ULMFiT | 45.84 | 41.22 |
Dataset | F1 | Precision | Recall | Notebook to Reproduce results |
---|---|---|---|---|
Dravidian Codemix HASOC @ FIRE 2020 (Latin Script) | 0.74 | 0.76 | 0.72 | Link |
Dravidian Codemix HASOC @ FIRE 2020 (Mixed Script) | 0.91 | 0.92 | 0.91 | Link |
Dravidian Codemix Sentiment Analysis @ FIRE 2020 | 0.69 | 0.69 | 0.7 | Link |
Architecture/Dataset | Malayalam Wikipedia Articles | Malayalam Wikipedia Articles |
---|---|---|
Latin Script | Mixed Script | |
ULMFiT | Embeddings projection | Embeddings projection |
Download Latin script pretrained ULMFiT LM from here
Download Mixed Script pretrained ULMFiT LM from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary for both Latin and Mixed scripts from here