You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:
DOC A: {"Dog": 0.43}
DOC B: {"Cat": 0.21}
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC D: {"Giant panda": 0.1}
DOC E: {"Dog": 0.33, "Cat": 0.66}
If I perform a query with the keyword ["Dog"], I want the following results (sorted by inner product distance):
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC A: {"Dog": 0.43}
DOC E: {"Dog": 0.33, "Cat": 0.66}
However, if I query with the keywords ["Dog", "Cat"], it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL @> operator). The sorting/ranking should then be done by inner product distance:
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC E: {"Dog": 0.33, "Cat": 0.66}
Final Notes
The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:
WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3']; ---- OR between keywordsWHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3']; ---- AND between keywords
This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.
My final question is whether we can achieve this Weighted-keywords search in pgvector; perhaps with new operators like && and @>.
Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:
Thanks for your suggestion! We're aware of the sparse index here and have some prototype at #552. For the weighted keyword search, we have tried https://github.com/tensorchord/pg_bestmatch.rs. Does it work for you? We're also thinking of a better way to directly build text index for bm25 search
I think that part of storing sparse vectors are solved. However i think, the query part must be rethinked.
I honestly think that the native to_tsquery API (link1link2) is great and can be replicated for sparse vectors (maybe to_svector_query). The to_tsquery API offers great flexibiliy (boolean operators):
SELECT to_tsquery('english', 'The & Fat & Rats');
SELECT to_tsquery('english', 'Fat | Rats');
SELECT to_tsquery('simple', 'Fat | Rats');
I think this type of API is great and can solve the previous animal keywors example, and also instead of 'english' we can add custom sparse encoders (TFIDF, BM25, SPLADE, CUSTOM)
SELECT to_svector_query('tfidf', 'We begin, as always, with the text.');
SELECT to_svector_query('bm25', 'We begin, as always, with the text.');
SELECT to_svector_query('simple', 'Dog & Cat);
SELECT to_svector_query('simple', 'Dog | Cat');
And for compute the distances, instead of the tsvector @@ operator, we colud use the existing sparse vector operators like neg inner prod <#>
SELECTtextFROM documents
ORDER BY sparse_keywords <#> to_svector_query('simple', 'Dog & Cat)LIMIT50;
Update
PostgreSQL ts_vector already can handle a token/keyword weight. However this is a discrete weight of only 4 different options (A,B,C or D).
Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default.
Weights are typically used to reflect document structure, for example by marking title words differently from body words. source
I think it would be abstraction that sparse vectors become the missing ts_vector with decimal (floating point) weigths, while keeping all the ts_query capabilitites.
In my opinion, sparse vectors can solve two different problems:
In both cases, we can have a sparse vector representation:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
{"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}
However, the use cases are different:
Text Search
For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:
BM25 vs SPLADE
This will be the tokens:
["BM", "25", "vs", "SPLADE"]
It is acceptable to return documents that contain a subset of these tokens, such as:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
Weighted-Keywords Search
Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:
DOC A: {"Dog": 0.43}
DOC B: {"Cat": 0.21}
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC D: {"Giant panda": 0.1}
DOC E: {"Dog": 0.33, "Cat": 0.66}
If I perform a query with the keyword
["Dog"]
, I want the following results (sorted by inner product distance):DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC A: {"Dog": 0.43}
DOC E: {"Dog": 0.33, "Cat": 0.66}
However, if I query with the keywords
["Dog", "Cat"]
, it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL@>
operator). The sorting/ranking should then be done by inner product distance:DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC E: {"Dog": 0.33, "Cat": 0.66}
Final Notes
The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:
This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.
My final question is whether we can achieve this Weighted-keywords search in
pgvector
; perhaps with new operators like&&
and@>
.Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:
Source
The text was updated successfully, but these errors were encountered: