How to embed word vectors in solr #7

damachi49 · 2019-07-16T15:10:57Z

Hello,

I was wondering, what were the steps you made to embed the trained word embeddings in solr. There seems to be no documentation on how to do it in Solr. Thanks a lot for your time and help.

Best regards

simonhughes22 · 2019-07-16T17:57:48Z

You have to make the outputs of word2vec work within an inverted index, which is works using sparse tokens (i.e. words) and not dense vectors.

In a nutshell:

Extract keywords and phrases (phrases are v important, treating hadoop and developer as two separate keywords and averaging the vectors didn't work well, treating hadoop_developer in word2vec as a single token works very well in contrast). How you do this is an NLP question, search for research into identifying colocations. PMI is one way to do this, there are many others
Train a word2vec model, that includes words and phrases from 1
Then either
- query word2vec for all words and phrases from 1, and take the top n (say 10) terms, ranked by similarity. word2vec supports this sort of query. Then at query time, use these keywords to do query expansion, using the cosine similarity to impact the word or phrasal boosts. However, make sure the original query terms are still given the highest boost. Also make sure the query is not an AND query, as we want it to match any of the associated expansion terms or phrases
- or cluster the embedding vectors using a clustering algo, e.g. k-means. Then map each map word->vector->cluster. Then assign a unique id to each cluster. Then at index time, index words/phrases into a cluster field, containing these cluster ids. at query time, looking up corresponding cluster id, and search on the cluster id
  - (so q=^5 OR cluster_field:^1). Tune for appropriate query boosts, in place of 5 and 1

All of the code in the repo will help you do the above, including phrasal identification, but it may not be 100% clear what is doing what. I strongly recommend reading the associated power point deck and watching the Lucene Revolution talk (linked to in GIthub) before going any further, if you haven';t already.

HTH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to embed word vectors in solr #7

How to embed word vectors in solr #7

damachi49 commented Jul 16, 2019

simonhughes22 commented Jul 16, 2019 •

edited

Loading

How to embed word vectors in solr #7

How to embed word vectors in solr #7

Comments

damachi49 commented Jul 16, 2019

simonhughes22 commented Jul 16, 2019 • edited Loading

simonhughes22 commented Jul 16, 2019 •

edited

Loading