This small repo collects a Makefile and a couple of (partially borrowed a/o adapted) scripts that produce GloVe models from freely available Wikipedia data, in Dutch and in French.
- the GloVe binaries
- the NLTK python library
- Giuseppe Attardi's wikiextractor script
$ mkdir src
$ cd src
$ wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
$ cd ..
```
The procedure for Dutch ('nl') then goes as follows:
# Step 1: Download and unzip Wikipedia data
````bash
$ mkdir data
$ cd data
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles1.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles2.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles3.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles4.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles1.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles2.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles3.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles4.xml.bz2
$ cd ..
```
# Step 2: Parse the data into smaller files
````bash
$ make LANG=nl parsewiki
```
The extracted files are to be found in texts/nl/text[1234]/??/wiki*
# Step 3: Split in sentences and tokenize
```bash
$ make LANG=nl out/nl_corpus.txt
```
# Step 4: Build the vocab file (using GloVe binaries)
```bash
$ make LANG=nl VOCAB_MIN_COUNT=3 out/nl_vocab.txt
```
# Step 5: Build the GloVe model (using GloVe binaries)
```bash
$ make LANG=nl VECTOR_SIZE=50 out/nl_vectors.txt
```
# Step 6: Publish the final model in [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) format
```bash
$ make LANG=nl publish
```
The final model is written in the directory _models_. It is directly usable with [gensim's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) module.
The procedure for French ('fr') is completely analogous. Other languages are currently not handled by the (admittedly oversimplified) tokenization script.