Creation of GloVe models from Wikipedia Data

This small repo collects a Makefile and a couple of (partially borrowed a/o adapted) scripts that produce GloVe models from freely available Wikipedia data, in Dutch and in French.

Step 0: Install prerequisite software

the GloVe binaries
the NLTK python library
Giuseppe Attardi's wikiextractor script

$ mkdir src
$ cd src
$ wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
$ cd ..
```

The procedure for Dutch ('nl') then goes as follows:

# Step 1: Download and unzip Wikipedia data

````bash
$ mkdir data
$ cd data
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles1.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles2.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles3.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles4.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles1.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles2.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles3.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles4.xml.bz2
$ cd ..
```

# Step 2: Parse the data into smaller files

````bash
$ make LANG=nl parsewiki
```

The extracted files are to be found in texts/nl/text[1234]/??/wiki*

# Step 3: Split in sentences and tokenize

```bash
$ make LANG=nl out/nl_corpus.txt
```

# Step 4: Build the vocab file (using GloVe binaries)

```bash
$ make LANG=nl VOCAB_MIN_COUNT=3 out/nl_vocab.txt
```

# Step 5: Build the GloVe model (using GloVe binaries)

```bash
$ make LANG=nl VECTOR_SIZE=50 out/nl_vectors.txt
```

# Step 6: Publish the final model in [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) format

```bash
$ make LANG=nl publish
```

The final model is written in the directory _models_. It is directly usable with [gensim's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) module.

The procedure for French ('fr') is completely analogous. Other languages are currently not handled by the (admittedly oversimplified) tokenization script.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creation of GloVe models from Wikipedia Data

Step 0: Install prerequisite software

About

Releases

Packages

Languages

License

fdurant/wiki_glove

Folders and files

Latest commit

History

Repository files navigation

Creation of GloVe models from Wikipedia Data

Step 0: Install prerequisite software

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages