Skip to content

A neural network to detect the language of the sentence.

Notifications You must be signed in to change notification settings

imdaredevil/language-detection

Repository files navigation

language-detection

A neural network to detect the language of short sentences ( less than or equal to 50 characters). In this repository I have implemented three kinds of algorithms and compared them for language identification using tatoeba dataset.

Included Languages

  • English
  • French
  • Italian
  • German
  • Spanish

Dataset

The tatoeba multi language dataset containing sentences from around 405 languages. Out of them, the sentences from the above languages are filtered out and are splitted to phrases with less than or equal to 50 characters. 50000 such phrases were randomly selected for each language and from all the sentences 25000 were used for testing and another 25000 for validation. After that, the performance was measured on the complete dataset of the above languages.

Models

  • simple count based model based on this blog. However, the blog did not discuss about using generators to do it in a memory efficient way. In this repository, I have implemented it. - 98.3 % accuracy Count based model
  • Pre-trained fasttext model - a continuous bag of words model with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. - 98.6% accuracy Fasttext model
  • Apple's Bi-LSTM model as proposed in this research paper - 99.35 % accuracy.

Apple's Bi-LSTM

About

A neural network to detect the language of the sentence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published