language-detection

A neural network to detect the language of short sentences ( less than or equal to 50 characters). In this repository I have implemented three kinds of algorithms and compared them for language identification using tatoeba dataset.

Included Languages

English
French
Italian
German
Spanish

Dataset

The tatoeba multi language dataset containing sentences from around 405 languages. Out of them, the sentences from the above languages are filtered out and are splitted to phrases with less than or equal to 50 characters. 50000 such phrases were randomly selected for each language and from all the sentences 25000 were used for testing and another 25000 for validation. After that, the performance was measured on the complete dataset of the above languages.

Models

simple count based model based on this blog. However, the blog did not discuss about using generators to do it in a memory efficient way. In this repository, I have implemented it. - 98.3 % accuracy
Pre-trained fasttext model - a continuous bag of words model with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. - 98.6% accuracy
Apple's Bi-LSTM model as proposed in this research paper - 99.35 % accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
outputs		outputs
.gitignore		.gitignore
Apple's Bi-LSTM model.ipynb		Apple's Bi-LSTM model.ipynb
README.md		README.md
Testing basic models.ipynb		Testing basic models.ipynb
count-based-model.ipynb		count-based-model.ipynb
fasttext-model.ipynb		fasttext-model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

language-detection

Included Languages

Dataset

Models

About

Releases

Packages

Languages

imdaredevil/language-detection

Folders and files

Latest commit

History

Repository files navigation

language-detection

Included Languages

Dataset

Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages