Skip to content

Creation of GloVe (Global Vectors for Word Representation) models from Wikipedia data

License

Notifications You must be signed in to change notification settings

fdurant/wiki_glove

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Creation of GloVe models from Wikipedia Data

This small repo collects a Makefile and a couple of (partially borrowed a/o adapted) scripts that produce GloVe models from freely available Wikipedia data, in Dutch and in French.

Step 0: Install prerequisite software

$ mkdir src
$ cd src
$ wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
$ cd ..
```

The procedure for Dutch ('nl') then goes as follows:

# Step 1: Download and unzip Wikipedia data

````bash
$ mkdir data
$ cd data
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles1.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles2.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles3.xml.bz2
$ wget https://dumps.wikimedia.org/nlwiki/20160501/nlwiki-20160501-pages-articles4.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles1.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles2.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles3.xml.bz2
$ bunzip2 nlwiki-20160501-pages-articles4.xml.bz2
$ cd ..
```

# Step 2: Parse the data into smaller files

````bash
$ make LANG=nl parsewiki
```

The extracted files are to be found in texts/nl/text[1234]/??/wiki*

# Step 3: Split in sentences and tokenize

```bash
$ make LANG=nl out/nl_corpus.txt
```

# Step 4: Build the vocab file (using GloVe binaries)

```bash
$ make LANG=nl VOCAB_MIN_COUNT=3 out/nl_vocab.txt
```

# Step 5: Build the GloVe model (using GloVe binaries)

```bash
$ make LANG=nl VECTOR_SIZE=50 out/nl_vectors.txt
```

# Step 6: Publish the final model in [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) format

```bash
$ make LANG=nl publish
```

The final model is written in the directory _models_. It is directly usable with [gensim's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) module.

The procedure for French ('fr') is completely analogous. Other languages are currently not handled by the (admittedly oversimplified) tokenization script.

About

Creation of GloVe (Global Vectors for Word Representation) models from Wikipedia data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published