Does the vocab_size match the actual size of vocab in word2vec.c? #6

GoogleCodeExporter · 2015-04-19T01:53:15Z



What steps will reproduce the problem?
1. Download attached text_simple train file
2. Compile word2vec.c as: gcc word2vec.c -o word2vec -lm -pthread
3. Run: ./word2vec -train text_simple -save-vocab vocab.txt

What is the expected output? What do you see instead?
Expect in saved vocab.txt file:
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
nine 9
===============
What is really seen in the file
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
===============

The last element "nine 5" wass missing.

What version of the product are you using? On what operating system?
MacOS, gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 
2336.11.00)

Please provide any additional information below.

This is NOT really a bug report because I am confused to understand the format 
of train_file and how the vocab is constructed from it.

Based on the source code of word2vec.c, when reading from train_file, it will

1. insert </s> as the first element in vocab

2. scan each word (or </s> for newline) in train_file, add it to vocab, and 
hash it in vocab_hash

So far the vocab_size = the number of words in vocab, INCLUDING </s> at the head

3. sort the words in vocab based on their counts, but keep </s> as the first of 
vocab

Now the vocab_size because the number of words in vocab, EXCLUDING the leading 
</s>. And if there is no newline character in train_file, </s> won't even be 
hashed in vocab_hash

So there is a inconsistency here between vocab_size and the actual size of 
vocab (including </s>). It could be a bug because later when the vocab is being 
iterated, it is always done by iterating the elements from 0 to vocab_size-1, 
like in SaveVocab(). This results in that the leading </s> will be saved, but 
the last element in vocab will be ignored. At least that's what it looks with a 
simple train file "text_simple" as attached here.

Original issue reported on code.google.com by [email protected] on 25 Aug 2013 at 2:38

Attachments:

text_simple

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-04-19T01:53:15Z

The same confusion as in function CreateBinaryTree() in word2vec.c, where the 
array representation of tree uses vocab_size*2+1 elements, which I understand 
is essentially (len(vocab)-1)*2+1? That makes sense as only n-1 nodes are 
needed for a full binary tree with n leaf nodes?

Many thanks if someone can clarify this a little bit.

Original comment by [email protected] on 25 Aug 2013 at 3:24

GoogleCodeExporter · 2015-04-19T01:53:15Z

Sorry the expected output of saved vocab.txt should be
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
nine 5
===============

It is a typo in the last line

Original comment by [email protected] on 26 Aug 2013 at 5:12

GoogleCodeExporter · 2015-04-19T01:53:15Z

also looking for the reason. there are words that are missing from the trained 
model, which are expected to be in the vocabulary since i put the min_count = 
1. 
(am working on CentOS)

Original comment by [email protected] on 27 Jan 2014 at 3:05

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Apr 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the vocab_size match the actual size of vocab in word2vec.c? #6

Does the vocab_size match the actual size of vocab in word2vec.c? #6

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015

Does the vocab_size match the actual size of vocab in word2vec.c? #6

Does the vocab_size match the actual size of vocab in word2vec.c? #6

Comments

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015

GoogleCodeExporter commented Apr 19, 2015