Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to embed word vectors in solr #7

Open
damachi49 opened this issue Jul 16, 2019 · 1 comment
Open

How to embed word vectors in solr #7

damachi49 opened this issue Jul 16, 2019 · 1 comment

Comments

@damachi49
Copy link

@simonhughes22

Hello,

I was wondering, what were the steps you made to embed the trained word embeddings in solr. There seems to be no documentation on how to do it in Solr. Thanks a lot for your time and help.

Best regards

@simonhughes22
Copy link
Contributor

simonhughes22 commented Jul 16, 2019

You have to make the outputs of word2vec work within an inverted index, which is works using sparse tokens (i.e. words) and not dense vectors.

In a nutshell:

  1. Extract keywords and phrases (phrases are v important, treating hadoop and developer as two separate keywords and averaging the vectors didn't work well, treating hadoop_developer in word2vec as a single token works very well in contrast). How you do this is an NLP question, search for research into identifying colocations. PMI is one way to do this, there are many others
  2. Train a word2vec model, that includes words and phrases from 1
  3. Then either
    • query word2vec for all words and phrases from 1, and take the top n (say 10) terms, ranked by similarity. word2vec supports this sort of query. Then at query time, use these keywords to do query expansion, using the cosine similarity to impact the word or phrasal boosts. However, make sure the original query terms are still given the highest boost. Also make sure the query is not an AND query, as we want it to match any of the associated expansion terms or phrases
    • or cluster the embedding vectors using a clustering algo, e.g. k-means. Then map each map word->vector->cluster. Then assign a unique id to each cluster. Then at index time, index words/phrases into a cluster field, containing these cluster ids. at query time, looking up corresponding cluster id, and search on the cluster id
      • (so q=^5 OR cluster_field:^1). Tune for appropriate query boosts, in place of 5 and 1

All of the code in the repo will help you do the above, including phrasal identification, but it may not be 100% clear what is doing what. I strongly recommend reading the associated power point deck and watching the Lucene Revolution talk (linked to in GIthub) before going any further, if you haven';t already.

HTH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants