Skip to content

Commit

Permalink
update vss docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Maxxen committed Sep 23, 2024
1 parent 4c1593e commit 8e68655
Show file tree
Hide file tree
Showing 4 changed files with 108 additions and 9 deletions.
12 changes: 6 additions & 6 deletions _posts/2024-05-03-vector-similarity-search-vss.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ The initial motivation for adding this data type was to provide optimized operat

However, as the hype for __vector embeddings__ and __semantic similarity search__ was growing, we also snuck in a couple of distance metric functions for this new `ARRAY` type:
[`array_distance`]({% link docs/sql/functions/array.md %}#array_distancearray1-array2),
[`array_inner_product`]({% link docs/sql/functions/array.md %}#array_inner_productarray1-array2) and
[`array_cosine_similarity`]({% link docs/sql/functions/array.md %}#array_cosine_similarityarray1-array2)
[`array_negatvie_inner_product`]({% link docs/sql/functions/array.md %}#array_negative_inner_productarray1-array2) and
[`array_cosine_distance`]({% link docs/sql/functions/array.md %}#array_cosine_distancearray1-array2)

> If you're one of today's [lucky 10,000](https://xkcd.com/1053/) and haven't heard of word embeddings or vector search, the short version is that it's a technique used to represent documents, images, entities – _data_ as high-dimensional _vectors_ and then search for _similar_ vectors in a vector space, using some sort of mathematical "distance" expression to measure similarity. This is used in a wide range of applications, from natural language processing to recommendation systems and image recognition, and has recently seen a surge in popularity due to the advent of generative AI and availability of pre-trained models.
Expand Down Expand Up @@ -79,22 +79,22 @@ LIMIT 3;
└───────────────────────────┘
```

You can pass the `HNSW` index creation statement a `metric` parameter to decide what kind of distance metric to use. The supported metrics are `l2sq`, `cosine` and `inner_product`, matching the three built-in distance functions: `array_distance`, `array_cosine_similarity` and `array_inner_product`.
You can pass the `HNSW` index creation statement a `metric` parameter to decide what kind of distance metric to use. The supported metrics are `l2sq`, `cosine` and `inner_product`, matching the three built-in distance functions: `array_distance`, `array_cosine_distance` and `array_negative_inner_product`.
The default is `l2sq`, which uses Euclidean distance (`array_distance`):

```sql
CREATE INDEX l2sq_idx ON embeddings USING HNSW (vec)
WITH (metric = 'l2sq');
```

To use cosine distance (`array_cosine_similarity`):
To use cosine distance (`array_cosine_distance`):

```sql
CREATE INDEX cos_idx ON embeddings USING HNSW (vec)
WITH (metric = 'cosine');
```

To use inner product (`array_inner_product`):
To use inner product (`array_negative_inner_product`):

```sql
CREATE INDEX ip_idx ON embeddings USING HNSW (vec)
Expand All @@ -115,7 +115,7 @@ We're actively working on addressing this and other issues related to index pers

At runtime however, much like the `ART` the `HNSW` index must be able to fit into RAM in its entirety, and the memory allocated by the `HNSW` at runtime is allocated "outside" of the DuckDB memory management system, meaning that it wont respect DuckDB's `memory_limit` configuration parameter.

Another current limitation with the `HNSW` index so far are that it only supports the `FLOAT` (a 32-bit, single-precision floating point) type for the array elements and only distance metrics corresponding to the three built in distance functions, `array_distance`, `array_inner_product` and `array_cosine_similarity`. But this is also something we're looking to expand upon in the near future as it is much less of a technical limitation and more of a "we haven't gotten around to it yet" limitation.
Another current limitation with the `HNSW` index so far are that it only supports the `FLOAT` (a 32-bit, single-precision floating point) type for the array elements and only distance metrics corresponding to the three built in distance functions, `array_distance`, `array_negative_inner_product` and `array_cosine_distance`. But this is also something we're looking to expand upon in the near future as it is much less of a technical limitation and more of a "we haven't gotten around to it yet" limitation.

## Conclusion

Expand Down
56 changes: 54 additions & 2 deletions docs/extensions/vss.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ The index will then be used to accelerate queries that use a `ORDER BY` clause e
SELECT * FROM my_vector_table ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3]) LIMIT 3;
```

Additionally, the overloaded `min_by(col, arg, n)` can also be accelerated with the `HNSW` index if the `arg` argument is a matching distance metric function. This can be used to do quick one-shot nearest neighbor searches. For example, to get the top 3 rows with the closest vectors to `[1, 2, 3]`:
```sql
SELECT min_by(my_vector_table, array_distance(vec, [1,2,3]::FLOAT[3]), 3) as result FROM my_vector_table;
---- [{'vec': [1.0, 2.0, 3.0]}, {'vec': [1.0, 2.0, 4.0]}, {'vec': [2.0, 2.0, 3.0]}]
```
Note how we pass the table name as the first argument to `min_by` to return a struct containing the entire matched row.

We can verify that the index is being used by checking the `EXPLAIN` output and looking for the `HNSW_INDEX_SCAN` node in the plan:

```sql
Expand Down Expand Up @@ -69,8 +76,8 @@ The following table shows the supported distance metrics and their corresponding
| Metric | Function | Description |
| -------- | ------------------------- | ------------------ |
| `l2sq` | `array_distance` | Euclidean distance |
| `cosine` | `array_cosine_similarity` | Cosine similarity |
| `ip` | `array_inner_product` | Inner product |
| `cosine` | `array_cosine_distance` | Cosine similarity distance |
| `ip` | `array_negative_inner_product` | Negative inner product |

Note that while each `HNSW` index only applies to a single column you can create multiple `HNSW` indexes on the same table each individually indexing a different column. Additionally, you can also create multiple `HNSW` indexes to the same column, each supporting a different distance metric.

Expand Down Expand Up @@ -106,9 +113,54 @@ The HNSW index does support inserting, updating and deleting rows from the table

To remedy the last point, you can call the `PRAGMA hnsw_compact_index('⟨index name⟩')` pragma function to trigger a re-compaction of the index pruning deleted items, or re-create the index after a significant number of updates.


## Bonus: Vector Similarity Search Joins

The `vss` extension also provides a couple of table macros to simplify matching multiple vectors against eachother, so called "fuzzy joins". These are:
* `vss_join(left_table, right_table, left_col, right_col, k, metric := 'l2sq')`
* `vss_match(right_table", left_col, right_col, k, metric := 'l2sq')`

These __do not__ currently make use of the `HNSW` index but are provided as convenience utility functions for users who are ok with performing brute-force vector similarity searches without having to write out the join logic themselves. In the future these might become targets for index-based optimizations as well.

These functions can be used as follows:

```sql
CREATE TABLE haystack (id int, vec FLOAT[3]);
CREATE TABLE needle(search_vec FLOAT[3]);

INSERT INTO haystack SELECT row_number() over (), array_value(a,b,c) FROM range(1,10) ra(a), range(1,10) rb(b), range(1,10) rc(c);

INSERT INTO needle VALUES ([5,5,5]), ([1,1,1]);

SELECT * FROM vss_join(needle, haystack, search_vec, vec, 3) as res;
┌───────┬─────────────────────────────────┬─────────────────────────────────────┐
│ score │ left_tbl │ right_tbl │
│ float │ struct(search_vec float[3]) │ struct(id integer, vec float[3]) │
├───────┼─────────────────────────────────┼─────────────────────────────────────┤
0.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 365, 'vec': [5.0, 5.0, 5.0]} │
1.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 364, 'vec': [5.0, 4.0, 5.0]} │
1.0 │ {'search_vec': [5.0, 5.0, 5.0]} │ {'id': 356, 'vec': [4.0, 5.0, 5.0]} │
0.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 1, 'vec': [1.0, 1.0, 1.0]} │
1.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 10, 'vec': [2.0, 1.0, 1.0]} │
1.0 │ {'search_vec': [1.0, 1.0, 1.0]} │ {'id': 2, 'vec': [1.0, 2.0, 1.0]} │
└───────┴─────────────────────────────────┴─────────────────────────────────────┘

-- Alternatively, we can use the vss_match macro as a "lateral join" to get the matches already grouped by the left table.
-- Note that this requires us to specify the left table first, and then the vss_match macro which references the search column from the left table (in this case, `search_vec`).
SELECT * FROM needle, vss_match(haystack, search_vec, vec, 3) as res;
┌─────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ search_vec │ matches │
│ float[3] │ struct(score float, "row" struct(id integer, vec float[3]))[] │
├─────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [5.0, 5.0, 5.0] │ [{'score': 0.0, 'row': {'id': 365, 'vec': [5.0, 5.0, 5.0]}}, {'score': 1.0, 'row': {'id': 364, 'vec': [5.0, 4.0, 5.0]}}, {'score': 1.0, 'row': {'id': 356, 'vec': [4.0, 5.0, 5.0]}}] │
│ [1.0, 1.0, 1.0] │ [{'score': 0.0, 'row': {'id': 1, 'vec': [1.0, 1.0, 1.0]}}, {'score': 1.0, 'row': {'id': 10, 'vec': [2.0, 1.0, 1.0]}}, {'score': 1.0, 'row': {'id': 2, 'vec': [1.0, 2.0, 1.0]}}] │
└─────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```

## Limitations

* Only vectors consisting of `FLOAT`s (32-bit, single precision) are supported at the moment.
* The index itself is not buffer managed and must be able to fit into RAM memory.
* The size of the index in memory does not count towards DuckDB's `memory_limit` configuration parameter.
* `HNSW` indexes can only be created on tables in in-memory databases, unless the `SET hnsw_enable_experimental_persistence = ⟨bool⟩` configuration option is set to `true`, see [Persistence](#persistence) for more information.
* The vector join table macros (`vss_join` and `vss_match`) do not require or make use of the `HNSW` index.

Check failure on line 166 in docs/extensions/vss.md

View workflow job for this annotation

GitHub Actions / markdown

Files should end with a single newline character
28 changes: 28 additions & 0 deletions docs/sql/functions/array.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,12 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
| [`array_value(index)`](#array_valueindex) | Create an `ARRAY` containing the argument values. |
| [`array_cross_product(array1, array2)`](#array_cross_productarray1-array2) | Compute the cross product of two arrays of size 3. The array elements can not be `NULL`. |
| [`array_cosine_similarity(array1, array2)`](#array_cosine_similarityarray1-array2) | Compute the cosine similarity between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. |
[`array_cosine_distance(array1, array2)`](#array_cosine_distancearray1-array2) | Compute the cosine distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `1.0 - array_cosine_distance` |
| [`array_distance(array1, array2)`](#array_distancearray1-array2) | Compute the distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. |
| [`array_inner_product(array1, array2)`](#array_inner_productarray1-array2) | Compute the inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. |
| [`array_negative_inner_product(array1, array2)`](#array_negative_inner_productarray1-array2) | Compute the negative inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `-array_inner_product` |
| [`array_dot_product(array1, array2)`](#array_dot_productarray1-array2) | Alias for `array_inner_product(array1, array2)`. |
| [`array_negative_dot_product(array1, array2)`](#array_negative_dot_productarray1-array2) | Alias for `array_negative_inner_product(array1, array2)`. |

#### `array_value(index)`

Expand All @@ -42,6 +45,14 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
| **Example** | `array_cosine_similarity(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
| **Result** | `0.9925833` |

#### `array_cosine_distance(array1, array2)`

<div class="nostroke_table"></div>

| **Description** | Compute the cosine distance between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `1.0 - array_cosine_distance`. |
| **Example** | `array_cosine_distance(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
| **Result** | `0.007416606` |

#### `array_distance(array1, array2)`

<div class="nostroke_table"></div>
Expand All @@ -58,10 +69,27 @@ All [`LIST` functions]({% link docs/sql/functions/list.md %}) work with the [`AR
| **Example** | `array_inner_product(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
| **Result** | `20.0` |

#### `array_negative_inner_product(array1, array2)`

<div class="nostroke_table"></div>

| **Description** | Compute the negative inner product between two arrays of the same size. The array elements can not be `NULL`. The arrays can have any size as long as the size is the same for both arguments. This is equivalent to `-array_inner_product` |
| **Example** | `array_inner_product(array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT), array_value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT))` |
| **Result** | `-20.0` |

#### `array_dot_product(array1, array2)`

<div class="nostroke_table"></div>

| **Description** | Alias for `array_inner_product(array1, array2)`. |
| **Example** | `array_dot_product(l1, l2)` |
| **Result** | `20.0` |


#### `array_negative_dot_product(array1, array2)`

<div class="nostroke_table"></div>

| **Description** | Alias for `array_negative_inner_product(array1, array2)`. |
| **Example** | `array_negative_dot_product(l1, l2)` |
| **Result** | `-20.0` |
Loading

0 comments on commit 8e68655

Please sign in to comment.