Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Adding support for FAISS #225

Open
raman-r-4978 opened this issue Sep 15, 2020 · 12 comments
Open

Adding support for FAISS #225

raman-r-4978 opened this issue Sep 15, 2020 · 12 comments
Assignees
Labels
Features New functionality added

Comments

@raman-r-4978
Copy link

raman-r-4978 commented Sep 15, 2020

Do you guys have any plans to support faiss other than nmslib in future?

Few issues I have encountered while using nmslib is,

  • It doesn't support mmap, leads to massive memory consumptions while reading index graphs.
  • Nowadays, the most common vector size for NLP is 768 or 1024, with that adding ~1M vectors to the nmslib index takes a very very long time to build graphs compared to the faiss IVFFlat type of indexes
  • Speaking of time, considering graph merges adds a whole additional complications to this.

Since faiss has a significant solutions to handle these issues, I would be happy to have both of them integrated into this plugin.

Attaching an ES issue thread that you might be interested in.

I have also created Java bindings for faiss which can be found here.

Hope it helps

@vamshin
Copy link
Member

vamshin commented Sep 16, 2020

Hi @ramanrajarathinam,

We are in plans to support FAISS. Nothing concrete yet. Will keep this thread open as a Feature request. Based on the community feedback we could prioritize the possibility of this feature. Those who end up on this thread looking for FAISS support please +1 this thread.

@vamshin vamshin added the Features New functionality added label Sep 16, 2020
@Kavan72
Copy link

Kavan72 commented Dec 15, 2020

+1

@YashalShakti
Copy link

+1

2 similar comments
@hiro-v
Copy link

hiro-v commented Jan 22, 2021

+1

@walker313504
Copy link

+1

@greav
Copy link

greav commented Mar 18, 2021

+1

@jmazanec15
Copy link
Member

As an update, we are working to add faiss support to the plugin. We recently received a contribution to add the library and its HNSW implementation. Because we do not see improvement with faiss's HNSW versus nmslib's, we have decided to incorporate other faiss methods before releasing. We will build off of that contribution in faiss-support branch. We are looking into adding functionality for inverted file systems, product quantization, as well as composite indices. Because these methods require training, the implementation is a little more complex. In the coming weeks, we will publish an RFC. In the meantime, please feel free to "+1" or mention a specific feature from faiss you would like to have supported.

@alwc
Copy link

alwc commented Mar 21, 2021

+1

@luyuncheng
Copy link

+1, As ml-supervised-workflow shows. may be we can use some workflow in faiss training

@jmazanec15
Copy link
Member

@luyuncheng That is a Elastic commercial feature, so we cannot use that.

I am exploring a couple approaches to training. First, adding a training step in the SaveIndex jni function that takes a subset of the vectors that will be indexed and uses them for training. This approach has several flaws including

  1. With training, segment creation can be very costly - producing long index times. From my experiments, training a Product Quantizer is pretty costly
  2. For encoding based methods, the raw vectors still need to be stored with Lucene because I do not believe it is possible to merge the encodings of 2 faiss indices with separately trained encoders without losing a significant amount of information.

I am working on the mapping interface to support faiss's composite indices, so I implemented this approach to be able to create trained faiss indices to test the interface.

As a second approach, I am going to explore adding a "train" api. In this approach, a user would create an Elasticsearch faiss index, and then they would also create a separate Elasticsearch index containing the training data. When they call the "train" api, it would create a faiss library index based on the configuration of the Elasticsearch faiss index, and then train the faiss library index with data from the training index, and then serialize the faiss library index in an Elasticsearch system index.

Then, when a user starts to ingest data, during segment creation, instead of creating a new, untrained index from faiss's index factory, it would create a copy of the empty, trained index from the faiss library index stored in the Elasticsearch system index. This way, training would only incur a one time cost when the train api is called, and thus speed up segment creation.

Additionally, if all segments use the same trained models, it would be easier to perform segment merges without relying on storing the raw vectors in Lucene. But I have not explored this in detail yet.

I would appreciate any feedback on either of these approaches and any other different approaches that might be worth considering.

@luyuncheng
Copy link

all segments use the same trained models
without relying on storing the raw vectors in Lucene

LGTM, i am wondering the data to be trained stored in the same index or separate into 2 indices

@jmazanec15
Copy link
Member

@luyuncheng My thinking on having a separate index is that it will be easier to delete. I think in theory, you could use the same index with this approach. This train API will require an index and a field in order to gather the training data. The index could be the same as the one being trained, but would be a separate field.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Features New functionality added
Projects
None yet
Development

No branches or pull requests

10 participants