You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering, what were the steps you made to embed the trained word embeddings in solr. There seems to be no documentation on how to do it in Solr. Thanks a lot for your time and help.
Best regards
The text was updated successfully, but these errors were encountered:
You have to make the outputs of word2vec work within an inverted index, which is works using sparse tokens (i.e. words) and not dense vectors.
In a nutshell:
Extract keywords and phrases (phrases are v important, treating hadoop and developer as two separate keywords and averaging the vectors didn't work well, treating hadoop_developer in word2vec as a single token works very well in contrast). How you do this is an NLP question, search for research into identifying colocations. PMI is one way to do this, there are many others
Train a word2vec model, that includes words and phrases from 1
Then either
query word2vec for all words and phrases from 1, and take the top n (say 10) terms, ranked by similarity. word2vec supports this sort of query. Then at query time, use these keywords to do query expansion, using the cosine similarity to impact the word or phrasal boosts. However, make sure the original query terms are still given the highest boost. Also make sure the query is not an AND query, as we want it to match any of the associated expansion terms or phrases
or cluster the embedding vectors using a clustering algo, e.g. k-means. Then map each map word->vector->cluster. Then assign a unique id to each cluster. Then at index time, index words/phrases into a cluster field, containing these cluster ids. at query time, looking up corresponding cluster id, and search on the cluster id
(so q=^5 OR cluster_field:^1). Tune for appropriate query boosts, in place of 5 and 1
All of the code in the repo will help you do the above, including phrasal identification, but it may not be 100% clear what is doing what. I strongly recommend reading the associated power point deck and watching the Lucene Revolution talk (linked to in GIthub) before going any further, if you haven';t already.
@simonhughes22
Hello,
I was wondering, what were the steps you made to embed the trained word embeddings in solr. There seems to be no documentation on how to do it in Solr. Thanks a lot for your time and help.
Best regards
The text was updated successfully, but these errors were encountered: