Skip to content

2. Getting the Corpus

akoksal edited this page May 1, 2018 · 2 revisions

We need to have big corpus to train word2vec model. You can access all wikipedia articles written in Turkish language from wikimedia dumps. The available one is 20180101 for this day and you can download all articles until 01/01/2018 by this link, 20180101. Of course, you can use another corpus to train word2vec model but you must modify your corpus to train a model with gensim library, explained below.

Previous: 1. Prerequisites
Next: 3. Preprocessing the Corpus