Spanish Word Embeddings

Spanish words embeddings computed using fastText on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

For new L we used the updated version of Spanish Unannotated Corpora which has 3B words and applied same preprocessing of the other models.

fastText Parameters

We set default parameters of fastText for Skipgram task except for epochs were we set 20 instead of 5.

Evaluation

We evaluated our word embeddings in SemEval-2017 Task 2 (Subtask 1) using the script provided by MUSE library, getting these results:

	XS	S	M	L	new L
Score	0.59150	0.67589	0.72345	*0.74676*	0.72940

Being L embedding model the best one in Spanish as far as we know in the date of publication.

Download

XS (word vectors=1313423, dim=10): model.bin, model.vec
S (word vectors=1313423, dim=30): model.bin, model.vec
M (word vectors=1313423, dim=100): model.bin, model.vec
L (word vectors=1313423, dim=300):model.bin, model.vec
new L (word vectors=1451827, dim=300):model.bin, model.vec

Reference

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spanish Word Embeddings

Pre-Processing

fastText Parameters

Evaluation

Download

Reference

Enriching Word Vectors with Subword Information

About

Releases

Packages

License

BotCenter/spanishWordEmbeddings

Folders and files

Latest commit

History

Repository files navigation

Spanish Word Embeddings

Pre-Processing

fastText Parameters

Evaluation

Download

Reference

Enriching Word Vectors with Subword Information

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages