Problems of generating Corpus file #23

zhq2009 · 2016-07-27T15:26:43Z

Hello,

We are using prepare.sh to generate Corpus file, but the Corpus file we generate is empty, could you please give us some suggestion of how to solve the problem?

Thank you very much

dav009 · 2016-07-29T03:14:16Z

what language are you trying?
can you paste the command you are running?

zhq2009 · 2016-08-02T04:17:24Z

Hello,

We are trying English wikipedia.
The command we are running is sudo sh prepare.sh en_US /mnt/data/, actually prepare.sh runs everything, such as downloads files and compiles programs.
We are wondering if we could get the executable programs directly. We were also experiencing compatibility problems and the generated corpus file is empty.

Thank you very much

zhq2009 · 2016-08-03T03:45:52Z

Hello,

We run the commands in prepare.sh manually and we get the corpus file successfully. We are currently train model using the corpus file, the massage we got from the command:

...
Requirement already satisfied (use --upgrade to upgrade): requests in /usr/lib/python2.7/dist-packages (from smart-open>=1.2.1->gensim)
Cleaning up...
pid 13182's current affinity mask: ff
pid 13182's new affinity mask: ff

and the program stays there for several hours, but the CPU usage is full.

We are wondering whether the program is running correctly and shall we wait until we get the results?

Thank you very much

dav009 · 2016-08-03T09:31:09Z

ZH, depending on the corpus size + number of dimensions, method(skipgram, cbow)
it can take a long time, usually for the settings of the shared models it took around 4,5 hours.
my advice is to let it run a few hours (at least 6).

Be aware that if you installed gensim manually, it might not be using all the cores.
The script provided in this repo installs it such that it uses as many cores as possible.

The first stage of word2vec will only use a single core tho (gathering the vocabulary), the batches of matrix factorization are done in parallel using as many cores as possible.

zhq2009 · 2016-08-09T03:07:12Z

Hello,

We use the command "wiki2vec.sh corpus output/model.w2c 50 500 10" to generate model file, after program runs for 20 hours, we get error message "IOError: [Errno 2] No such file or directory: '/home/_/_/wiki2vec/wiki2vec-master/results/model.w2c.syn1neg.npy'".

Could you please give us some suggestions about how to solve the problem?

Thank you very much.

RishabGargeya · 2016-12-31T09:14:06Z

Hi, @zhq2009 was this issue ever resolved?

zhq2009 · 2017-01-03T03:06:48Z

Hello, Yes, the problem was solved. Thank you very much.

…

On Sat, Dec 31, 2016 at 4:14 AM, Rishab Gargeya ***@***.***> wrote: Hi, @zhq2009 <https://github.com/zhq2009> was this issue ever resolved? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARKSjKEJNfE-sArEe0yNTJCe3iLEUqH1ks5rNhzfgaJpZM4JWUET> .

matthewdparker · 2017-02-02T01:53:20Z

Hi, I'm having the same problem when I try to generate the Corpus file - the file keeps coming up empty. I'm running the following command:

sudo sh prepare.sh en_US ~/data

Do you know why this might be?

Thank you!

Aditi138 · 2017-06-12T07:47:48Z

Hi, I am also facing the same issue.

When I ran the following snippet from gensim.models import Word2Vec
model = Word2Vec.load("path/to/word2vec/en.model")
model.similarity('woman', 'man'), I got the following error

" array.shape = shape
ValueError: cannot reshape array of size 108 into shape (1151090,1000)"

Next when I run "sudo sh prepare.sh en_US ~/data", the corpus file is empty.
Could that be related, and if not how to solve these 2 issues?

keynmol added the fandango label Jul 27, 2016

Lugrin added the backlog label Oct 21, 2016

Lugrin added icebox and removed backlog labels Apr 10, 2017

mal removed the fandango label Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems of generating Corpus file #23

Problems of generating Corpus file #23

zhq2009 commented Jul 27, 2016

dav009 commented Jul 29, 2016

zhq2009 commented Aug 2, 2016

zhq2009 commented Aug 3, 2016

dav009 commented Aug 3, 2016

zhq2009 commented Aug 9, 2016

RishabGargeya commented Dec 31, 2016

zhq2009 commented Jan 3, 2017 via email

matthewdparker commented Feb 2, 2017

Aditi138 commented Jun 12, 2017

Problems of generating Corpus file #23

Problems of generating Corpus file #23

Comments

zhq2009 commented Jul 27, 2016

dav009 commented Jul 29, 2016

zhq2009 commented Aug 2, 2016

zhq2009 commented Aug 3, 2016

dav009 commented Aug 3, 2016

zhq2009 commented Aug 9, 2016

RishabGargeya commented Dec 31, 2016

zhq2009 commented Jan 3, 2017 via email

matthewdparker commented Feb 2, 2017

Aditi138 commented Jun 12, 2017