You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps will reproduce the problem?
--Using the python word2vec module in ipython. Loaded the model from
GoogleNews-vectors-negative300.bin using the command:
model = word2vec.load("~/Downloads/GoogleNews-vectors-negative300.bin")
What is the expected output? What do you see instead?
The vocab of the model looks like it is made of of english words that have been
stripped of their first character. As a result, many common words are missing.
Correctly spelled words which are found in vocab actually represent collisions
created when removing the first character.
For instance:
model.cosine('out')
returns:
{'out': [('outs', 0.8092596703376076),
('eavyweight_bout', 0.65542583911176289),
('ightweight_bout', 0.64856198153561295),
('ndercard_bout', 0.62005739361720136),
('iddleweight_bout', 0.61811559624397572),
('assily_Jirov', 0.61172633394627596),
('atchweight_bout', 0.60739346001729411),
('uper_middleweight_bout', 0.60237084554945242),
('eatherweight_bout', 0.60183827323165029),
("KO'd", 0.60002383627451883)]}
The string 'out' actually represents the english word 'bout' which has been
correctly grouped with other boxing terms. Note the similar terms are also
missing their first characters.
Another example:
model.cosine('aul')
returns
{'aul': [('ohn', 0.82979825790046702),
('eter', 0.750119256790031),
('ark', 0.71162490811744983),
('ndrew', 0.66359523924163855),
('hris', 0.66228796043431837),
('ichard', 0.66142257169136376),
('hilip', 0.6576444040097873),
('ichael', 0.64312885937086905),
('on', 0.64042190735670823),
('avid', 0.63592487085268301)]}
This group of words is gibberish but represents a cluster of common male names.
The full names like 'john' however are not present in the vocabularly.
It looks like the vector representation is doing a very good job of capturing
the linguistic structure. However, the inconvenient absence of first characters
creates many unfortunate collisions.
Original issue reported on code.google.com by [email protected] on 15 Dec 2014 at 7:54
The text was updated successfully, but these errors were encountered:
I noticed this problem myself. In the file wordvectors.py on line 171 they read
an extra character after each vector. This just sends the first letter to
nowhere. If you comment this line out then it works i.e.
171: #fin.read(1) # newline
I noticed this problem myself. In the file wordvectors.py on line 171 they read
an extra character after each vector. This just sends the first letter to
nowhere. If you comment this line out then it works i.e.
171: #fin.read(1) # newline
The suggestion above worked for me. Note that you have to undo this if you were
to read binary files other than the google news one. If you continue with
commenting out the new line you will get corrupt vocabs like this :
vocab
-->
array([u'\nthe', u'\nof', u'\nand', u'\nto', u'\nin', u'\nor', u'\na',
u'\nfor', u'\nany', u'\nby', u'\nas', u'\nThe', u'\nbe', u'\nsuch',
u'\nshall', u'\nCompany', u'\nis', u'\non', u'\n._.'],
dtype='<U78')
Original issue reported on code.google.com by
[email protected]
on 15 Dec 2014 at 7:54The text was updated successfully, but these errors were encountered: