update vss docs

duckdb · Sep 23, 2024 · 8e68655 · 8e68655
1 parent 4c1593e
commit 8e68655
Show file tree

Hide file tree

Showing 4 changed files with 108 additions and 9 deletions.
diff --git a/_posts/2024-05-03-vector-similarity-search-vss.md b/_posts/2024-05-03-vector-similarity-search-vss.md
@@ -12,8 +12,8 @@ The initial motivation for adding this data type was to provide optimized operat
 
 However, as the hype for __vector embeddings__ and __semantic similarity search__ was growing, we also snuck in a couple of distance metric functions for this new `ARRAY` type:
 [`array_distance`]({% link docs/sql/functions/array.md %}#array_distancearray1-array2),
-[`array_inner_product`]({% link docs/sql/functions/array.md %}#array_inner_productarray1-array2) and
-[`array_cosine_similarity`]({% link docs/sql/functions/array.md %}#array_cosine_similarityarray1-array2)
+[`array_negatvie_inner_product`]({% link docs/sql/functions/array.md %}#array_negative_inner_productarray1-array2) and
+[`array_cosine_distance`]({% link docs/sql/functions/array.md %}#array_cosine_distancearray1-array2)
 
 > If you're one of today's [lucky 10,000](https://xkcd.com/1053/) and haven't heard of word embeddings or vector search, the short version is that it's a technique used to represent documents, images, entities – _data_ as high-dimensional _vectors_ and then search for _similar_ vectors in a vector space, using some sort of mathematical "distance" expression to measure similarity. This is used in a wide range of applications, from natural language processing to recommendation systems and image recognition, and has recently seen a surge in popularity due to the advent of generative AI and availability of pre-trained models.
 
@@ -79,22 +79,22 @@ LIMIT 3;
 └───────────────────────────┘
 ```
 
-You can pass the `HNSW` index creation statement a `metric` parameter to decide what kind of distance metric to use. The supported metrics are `l2sq`, `cosine` and `inner_product`, matching the three built-in distance functions: `array_distance`, `array_cosine_similarity` and `array_inner_product`.
+You can pass the `HNSW` index creation statement a `metric` parameter to decide what kind of distance metric to use. The supported metrics are `l2sq`, `cosine` and `inner_product`, matching the three built-in distance functions: `array_distance`, `array_cosine_distance` and `array_negative_inner_product`.
 The default is `l2sq`, which uses Euclidean distance (`array_distance`):
 
 ```sql
 CREATE INDEX l2sq_idx ON embeddings USING HNSW (vec)
 WITH (metric = 'l2sq');
 ```
 
-To use cosine distance (`array_cosine_similarity`):
+To use cosine distance (`array_cosine_distance`):
 
 ```sql
 CREATE INDEX cos_idx ON embeddings USING HNSW (vec)
 WITH (metric = 'cosine');
 ```
 
-To use inner product (`array_inner_product`):
+To use inner product (`array_negative_inner_product`):
 
 ```sql
 CREATE INDEX ip_idx ON embeddings USING HNSW (vec)
@@ -115,7 +115,7 @@ We're actively working on addressing this and other issues related to index pers
 
 At runtime however, much like the `ART` the `HNSW` index must be able to fit into RAM in its entirety, and the memory allocated by the `HNSW` at runtime is allocated "outside" of the DuckDB memory management system, meaning that it wont respect DuckDB's `memory_limit` configuration parameter.
 
-Another current limitation with the `HNSW` index so far are that it only supports the `FLOAT` (a 32-bit, single-precision floating point) type for the array elements and only distance metrics corresponding to the three built in distance functions, `array_distance`, `array_inner_product` and `array_cosine_similarity`. But this is also something we're looking to expand upon in the near future as it is much less of a technical limitation and more of a "we haven't gotten around to it yet" limitation.
+Another current limitation with the `HNSW` index so far are that it only supports the `FLOAT` (a 32-bit, single-precision floating point) type for the array elements and only distance metrics corresponding to the three built in distance functions, `array_distance`, `array_negative_inner_product` and `array_cosine_distance`. But this is also something we're looking to expand upon in the near future as it is much less of a technical limitation and more of a "we haven't gotten around to it yet" limitation.
 
 ## Conclusion
 

diff --git a/docs/extensions/vss.md b/docs/extensions/vss.md
@@ -24,6 +24,13 @@ The index will then be used to accelerate queries that use a `ORDER BY` clause e
 SELECT * FROM my_vector_table ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3]) LIMIT 3;
 ```
 
+Additionally, the overloaded `min_by(col, arg, n)` can also be accelerated with the `HNSW` index if the `arg` argument is a matching distance metric function. This can be used to do quick one-shot nearest neighbor searches. For example, to get the top 3 rows with the closest vectors to `[1, 2, 3]`:
+```sql
+SELECT min_by(my_vector_table, array_distance(vec, [1,2,3]::FLOAT[3]), 3) as result FROM my_vector_table;
+---- [{'vec': [1.0, 2.0, 3.0]}, {'vec': [1.0, 2.0, 4.0]}, {'vec': [2.0, 2.0, 3.0]}] 
+```
+Note how we pass the table name as the first argument to `min_by` to return a struct containing the entire matched row.
+
 We can verify that the index is being used by checking the `EXPLAIN` output and looking for the `HNSW_INDEX_SCAN` node in the plan:
 
 ```sql
@@ -69,8 +76,8 @@ The following table shows the supported distance metrics and their corresponding
 | Metric   | Function                  | Description        |
 | -------- | ------------------------- | ------------------ |
 | `l2sq`   | `array_distance`          | Euclidean distance |
-| `cosine` | `array_cosine_similarity` | Cosine similarity  |
-| `ip`     | `array_inner_product`     | Inner product      |
+| `cosine` | `array_cosine_distance` | Cosine similarity distance  |
+| `ip`     | `array_negative_inner_product`     | Negative inner product      |
 
 Note that while each `HNSW` index only applies to a single column you can create multiple `HNSW` indexes on the same table each individually indexing a different column. Additionally, you can also create multiple `HNSW` indexes to the same column, each supporting a different distance metric.
 
@@ -106,9 +113,54 @@ The HNSW index does support inserting, updating and deleting rows from the table
 
 To remedy the last point, you can call the `PRAGMA hnsw_compact_index('⟨index name⟩')` pragma function to trigger a re-compaction of the index pruning deleted items, or re-create the index after a significant number of updates.
 
+
+## Bonus: Vector Similarity Search Joins
+
+The `vss` extension also provides a couple of table macros to simplify matching multiple vectors against eachother, so called "fuzzy joins". These are: 
+* `vss_join(left_table, right_table, left_col, right_col, k, metric := 'l2sq')`
+* `vss_match(right_table", left_col, right_col, k, metric := 'l2sq')`
+
+These __do not__ currently make use of the `HNSW` index but are provided as convenience utility functions for users who are ok with performing brute-force vector similarity searches without having to write out the join logic themselves. In the future these might become targets for index-based optimizations as well.
+
+These functions can be used as follows:
+
+```sql
+CREATE TABLE haystack (id int, vec FLOAT[3]);
+CREATE TABLE needle(search_vec FLOAT[3]);
+
+INSERT INTO haystack SELECT row_number() over (), array_value(a,b,c) FROM range(1,10) ra(a), range(1,10) rb(b), range(1,10) rc(c);
+
+INSERT INTO needle VALUES ([5,5,5]), ([1,1,1]);
+
+SELECT * FROM vss_join(needle, haystack, search_vec, vec, 3) as res;
+┌───────┬─────────────────────────────────┬─────────────────────────────────────┐
+│ score │            left_tbl             │              right_tbl              │
+│ float │   struct(search_vec float[3])   │  struct(id integer, vec float[3])   │
+├───────┼─────────────────────────────────┼─────────────────────────────────────┤
+│   0.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 365, 'vec': [5.0, 5.0, 5.0]} │
+│   1.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 364, 'vec': [5.0, 4.0, 5.0]} │
+│   1.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 356, 'vec': [4.0, 5.0, 5.0]} │
+│   0.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 1, 'vec': [1.0, 1.0, 1.0]}   │
+│   1.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 10, 'vec': [2.0, 1.0, 1.0]}  │
+│   1.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 2, 'vec': [1.0, 2.0, 1.0]}   │
+└───────┴─────────────────────────────────┴─────────────────────────────────────┘
+
+-- Alternatively, we can use the vss_match macro as a "lateral join" to get the matches already grouped by the left table.
+-- Note that this requires us to specify the left table first, and then the vss_match macro which references the search column from the left table (in this case, `search_vec`).
+SELECT * FROM needle, vss_match(haystack, search_vec, vec, 3) as res;
+┌─────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
+│   search_vec    │                                                                                       matches                                                                                        │
+│    float[3]     │                                                            struct(score float, "row" struct(id integer, vec float[3]))[]                                                             │
+├─────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│ [5.0, 5.0, 5.0] │ [{'score': 0.0, 'row': {'id': 365, 'vec': [5.0, 5.0, 5.0]}}, {'score': 1.0, 'row': {'id': 364, 'vec': [5.0, 4.0, 5.0]}}, {'score': 1.0, 'row': {'id': 356, 'vec': [4.0, 5.0, 5.0]}}] │
+│ [1.0, 1.0, 1.0] │ [{'score': 0.0, 'row': {'id': 1, 'vec': [1.0, 1.0, 1.0]}}, {'score': 1.0, 'row': {'id': 10, 'vec': [2.0, 1.0, 1.0]}}, {'score': 1.0, 'row': {'id': 2, 'vec': [1.0, 2.0, 1.0]}}]      │
+└─────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
 ## Limitations
 
 * Only vectors consisting of `FLOAT`s (32-bit, single precision) are supported at the moment.
 * The index itself is not buffer managed and must be able to fit into RAM memory.
 * The size of the index in memory does not count towards DuckDB's `memory_limit` configuration parameter.
 * `HNSW` indexes can only be created on tables in in-memory databases, unless the `SET hnsw_enable_experimental_persistence = ⟨bool⟩` configuration option is set to `true`, see [Persistence](#persistence) for more information.
+* The vector join table macros (`vss_join` and `vss_match`) do not require or make use of the `HNSW` index.
diff --git a/docs/sql/functions/array.md b/docs/sql/functions/array.md
@@ -14,9 +14,12 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
 | [`array_value(index)`](#array_valueindex)                                          | Create an `ARRAY` containing the argument values.                                                                                                                                         |
 | [`array_cross_product(array1, array2)`](#array_cross_productarray1-array2)         | Compute the cross product of two arrays of size 3. The array elements can not be `NULL`.                                                                                                  |
 | [`array_cosine_similarity(array1, array2)`](#array_cosine_similarityarray1-array2) | Compute the cosine similarity between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. |
+[`array_cosine_distance(array1, array2)`](#array_cosine_distancearray1-array2) | Compute the cosine distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `1.0 - array_cosine_distance` |
 | [`array_distance(array1, array2)`](#array_distancearray1-array2)                   | Compute the distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments.          |
 | [`array_inner_product(array1, array2)`](#array_inner_productarray1-array2)         | Compute the inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments.     |
+| [`array_negative_inner_product(array1, array2)`](#array_negative_inner_productarray1-array2)         | Compute the negative inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `-array_inner_product` |
 | [`array_dot_product(array1, array2)`](#array_dot_productarray1-array2)             | Alias for `array_inner_product(array1, array2)`.                                                                                                                                          |
+| [`array_negative_dot_product(array1, array2)`](#array_negative_dot_productarray1-array2)             | Alias for `array_negative_inner_product(array1, array2)`.                                                                                                                                          |
 
 #### `array_value(index)`
 
@@ -42,6 +45,14 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
 | **Example** | `array_cosine_similarity(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
 | **Result** | `0.9925833` |
 
+#### `array_cosine_distance(array1, array2)`
+
+<div class="nostroke_table"></div>
+
+| **Description** | Compute the cosine distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `1.0 - array_cosine_distance`. |
+| **Example** | `array_cosine_distance(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
+| **Result** | `0.007416606` |
+
 #### `array_distance(array1, array2)`
 
 <div class="nostroke_table"></div>
@@ -58,10 +69,27 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
 | **Example** | `array_inner_product(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
 | **Result** | `20.0` |
 
+#### `array_negative_inner_product(array1, array2)`
+
+<div class="nostroke_table"></div>
+
+| **Description** | Compute the negative inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `-array_inner_product` |
+| **Example** | `array_inner_product(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
+| **Result** | `-20.0` |
+
 #### `array_dot_product(array1, array2)`
 
 <div class="nostroke_table"></div>
 
 | **Description** | Alias for `array_inner_product(array1, array2)`. |
 | **Example** | `array_dot_product(l1, l2)` |
 | **Result** | `20.0` |
+
+
+#### `array_negative_dot_product(array1, array2)`
+
+<div class="nostroke_table"></div>
+
+| **Description** | Alias for `array_negative_inner_product(array1, array2)`. |
+| **Example** | `array_negative_dot_product(l1, l2)` |
+| **Result** | `-20.0` |