Sparse vectors: 1 representation, 2 use cases #587

javiabellan · 2024-09-03T10:56:28Z

In my opinion, sparse vectors can solve two different problems:

Text search (TFIDF, BM25, SPLADE, etc.)
Weighted-keywords search

In both cases, we can have a sparse vector representation:

Text: {'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
Weighted-keywords: {"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}

However, the use cases are different:

Text Search

For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:

BM25 vs SPLADE

This will be the tokens:

["BM", "25", "vs", "SPLADE"]

It is acceptable to return documents that contain a subset of these tokens, such as:

{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}

Weighted-Keywords Search

Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:

DOC A: {"Dog": 0.43}
DOC B: {"Cat": 0.21}
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC D: {"Giant panda": 0.1}
DOC E: {"Dog": 0.33, "Cat": 0.66}

If I perform a query with the keyword ["Dog"], I want the following results (sorted by inner product distance):

DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC A: {"Dog": 0.43}
DOC E: {"Dog": 0.33, "Cat": 0.66}

However, if I query with the keywords ["Dog", "Cat"], it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL @> operator). The sorting/ranking should then be done by inner product distance:

DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC E: {"Dog": 0.33, "Cat": 0.66}

Final Notes

The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:

WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- OR between keywords
WHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- AND between keywords

This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.

My final question is whether we can achieve this Weighted-keywords search in pgvector; perhaps with new operators like && and @>.

Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:

Source

The text was updated successfully, but these errors were encountered:

VoVAllen · 2024-09-05T08:08:56Z

Thanks for your suggestion! We're aware of the sparse index here and have some prototype at #552. For the weighted keyword search, we have tried https://github.com/tensorchord/pg_bestmatch.rs. Does it work for you? We're also thinking of a better way to directly build text index for bm25 search

javiabellan · 2024-09-05T12:24:57Z

Thanks for aswering, i will take a look at pg_bestmatch.rs.

I think that part of storing sparse vectors are solved. However i think, the query part must be rethinked.

I honestly think that the native to_tsquery API (link1 link2) is great and can be replicated for sparse vectors (maybe to_svector_query). The to_tsquery API offers great flexibiliy (boolean operators):

SELECT to_tsquery('english', 'The & Fat & Rats');
SELECT to_tsquery('english', 'Fat | Rats');
SELECT to_tsquery('simple', 'Fat | Rats');

I think this type of API is great and can solve the previous animal keywors example, and also instead of 'english' we can add custom sparse encoders (TFIDF, BM25, SPLADE, CUSTOM)

SELECT to_svector_query('tfidf', 'We begin, as always, with the text.');
SELECT to_svector_query('bm25', 'We begin, as always, with the text.');
SELECT to_svector_query('simple', 'Dog & Cat);
SELECT to_svector_query('simple', 'Dog | Cat');

And for compute the distances, instead of the tsvector @@ operator, we colud use the existing sparse vector operators like neg inner prod <#>

SELECT text
FROM documents
ORDER BY sparse_keywords <#> to_svector_query('simple', 'Dog & Cat)
LIMIT 50;

Update

PostgreSQL ts_vector already can handle a token/keyword weight. However this is a discrete weight of only 4 different options (A,B,C or D).

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default.
Weights are typically used to reflect document structure, for example by marking title words differently from body words.
source

I think it would be abstraction that sparse vectors become the missing ts_vector with decimal (floating point) weigths, while keeping all the ts_query capabilitites.

jbohnslav · 2024-10-05T14:06:20Z

An implementation of bm25 would be huge for my application. What tasks remain to be done?

VoVAllen · 2024-10-06T16:21:58Z

@jbohnslav We don't have an ETA for this. The main reason is that the algorithm is too new and we're not sure what's it's actual behavior with different data like BM25 and how to setup the config param. For bm25, have you tried paradedb? They claim to have full support of bm25

gaocegege added the type/question 🙋 Further information is requested label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse vectors: 1 representation, 2 use cases #587

Sparse vectors: 1 representation, 2 use cases #587

javiabellan commented Sep 3, 2024 •

edited

Loading

VoVAllen commented Sep 5, 2024

javiabellan commented Sep 5, 2024 •

edited

Loading

jbohnslav commented Oct 5, 2024

VoVAllen commented Oct 6, 2024

Sparse vectors: 1 representation, 2 use cases #587

Sparse vectors: 1 representation, 2 use cases #587

Comments

javiabellan commented Sep 3, 2024 • edited Loading

Text Search

Weighted-Keywords Search

Final Notes

VoVAllen commented Sep 5, 2024

javiabellan commented Sep 5, 2024 • edited Loading

Update

jbohnslav commented Oct 5, 2024

VoVAllen commented Oct 6, 2024

javiabellan commented Sep 3, 2024 •

edited

Loading

javiabellan commented Sep 5, 2024 •

edited

Loading