Skip to content

Spanish Word Embeddings computed from large corpora and different sizes using fastText.

License

Notifications You must be signed in to change notification settings

BotCenter/spanishWordEmbeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Spanish Word Embeddings

DOI

Spanish words embeddings computed using fastText on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

For new L we used the updated version of Spanish Unannotated Corpora which has 3B words and applied same preprocessing of the other models.

fastText Parameters

We set default parameters of fastText for Skipgram task except for epochs were we set 20 instead of 5.

Evaluation

We evaluated our word embeddings in SemEval-2017 Task 2 (Subtask 1) using the script provided by MUSE library, getting these results:

XS S M L new L
Score 0.59150 0.67589 0.72345 0.74676 0.72940

Being L embedding model the best one in Spanish as far as we know in the date of publication.

Download

Reference

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

Releases

No releases published

Packages

No packages published