Spanish words embeddings computed using fastText on the Spanish Unannotated Corpora.
The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.
According to that tokenization, the 2.6B words corpus got into 3.4B tokens.
For new L we used the updated version of Spanish Unannotated Corpora which has 3B words and applied same preprocessing of the other models.
We set default parameters of fastText for Skipgram task except for epochs were we set 20 instead of 5.
We evaluated our word embeddings in SemEval-2017 Task 2 (Subtask 1) using the script provided by MUSE library, getting these results:
XS | S | M | L | new L | |
---|---|---|---|---|---|
Score | 0.59150 | 0.67589 | 0.72345 | 0.74676 | 0.72940 |
Being L embedding model the best one in Spanish as far as we know in the date of publication.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information