Replies: 5 comments 1 reply
-
This is surprising. What is the initial data dimensionality? Would it be possible to share the vectors? |
Beta Was this translation helpful? Give feedback.
-
Hey there! Thanks for taking a look! The vectors have shape: Here's a file containing a small subset (100) of the vectors: 2020_12_12_subset.fvecs.zip. Here is a link to a file containing the entire set of vectors added to a single index (one index per month of data). The training set is generated by randomly sampling across all monthly files, until we have between Here is a single vector (before normalizing):
|
Beta Was this translation helpful? Give feedback.
-
Hi there @mdouze! Just wondering if you might have had a chance to take another look at this? Cheers! |
Beta Was this translation helpful? Give feedback.
-
have you got success with the problem ? |
Beta Was this translation helpful? Give feedback.
-
how to deal with the datasets now ?@leothomas |
Beta Was this translation helpful? Give feedback.
-
Summary
Hey there!
I'm looking into switching from L2 distance metric to Inner Product distance metric. Our current index is built using PCA + IVFFlat (
PCA128,IVF{K},Flat
) with L2 distance, where the input vectors have dimension 512 and the number of IVF centroids is defined as:4*sqrt(N) < K < 16*sqrt(N)
(N = number of vectors indexed).Compared to a
Flat
index, this index reaches a kNN intersection measure of 0.96 @ rank 100.However when I build the same index with an inner product distance metric (vectors are normalized prior to training, adding to the index, and searching for both the L2 and IP distance metrics) I get a kNN intersection measure of 0.019 @ rank 100. Setting
nprobe
to the number of centroids (to mimic a Flat search) actually reduces the kNN intersection measure to 0.003 @ rank 100.Without the PCA pre-processing, the IVF index with inner product distance metric has a kNN intersection measure of 0.97 @ rank 100 - which is ideal, but the index is simply much too big to hold in memory.
Is there some sort of fundamental incompatibility between PCA pre-processing and Inner Product distance metric?
I was able to achieve excellent compression and a kNN intersection measure of ~ 0.70 @ rank 100 with the
OPQ{M}_{D},IVF{K},PQ{M}
andOPQ{M}_{D},IVF{K}_HNSW32,PQ{M}
indexes with the inner product distance metric. Are there any other indexing recommendations for pre-processing, coarse or fine quantization, or even search time parameters (efSearch
, etc) that might work better with the inner product distance metric?Thanks again for taking a look at this !
Platform
OS: macOS 13.01
Faiss version: 1.7.3
Installed from:
pip install 'faiss-cpu==1.7.3'
Faiss compilation options:
Running on:
Interface:
Reproduction instructions
Beta Was this translation helpful? Give feedback.
All reactions