Skip to content

Commit

Permalink
Metadata filtering (#124)
Browse files Browse the repository at this point in the history
* initial pass at PARTITION KEY support.

* Initial pass, allow auxiliary columns on vec0 virtual tables

* update TODO

* Initial pass at metadata filtering

* unit tests

* gha this PR branch

* fixup tests

* doc internal

* fix tests, KNN/rowids in

* define SQLITE_INDEX_CONSTRAINT_OFFSET

* whoops

* update tests, syrupy, use uv

* un ignore pyproject.toml

* dot

* tests/

* type error?

* win: .exe, update error name

* try fix macos python, paren around expr?

* win bash?

* dbg :(

* explicit error

* op

* dbg win

* win ./tests/.venv/Scripts/python.exe

* block UPDATEs on partition key values for now

* test this branch

* accidentally removved "partition key type mistmatch" block during merge

* typo ugh

* bruv

* start aux snapshots

* drop aux shadow table on destroy

* enforce column types

* block WHERE constraints on auxiliary columns in KNN queries

* support delete

* support UPDATE on auxiliary columns

* test this PR

* dont inline that

* test-metadata.py

* memzero text buffer

* stress test

* more snpashot tests

* rm double/int32, just float/int64

* finish type checking

* long text support

* DELETE support

* UPDATE support

* fix snapshot names

* drop not-used in eqp

* small fixes

* boolean comparison handling

* ensure error is raised when long string constraint

* new version string for beta builds

* typo whoops

* ann-filtering-benchmark directory

* test-case

* updates

* fix aux column error when using non-default rowid values, needs test

* refactor some text knn filtering

* rowids blob read only on text metadata filters

* refactor

* add failing test causes for non eq text knn

* text knn NE

* test cases diff

* GT

* text knn GT/GE fixes

* text knn LT/LE

* clean

* vtab_in handling

* unblock aux failures for now

* guard sqlite3_vtab_in

* else in guard?

* fixes and tests

* add broken shadow table test

* rename _metadata_chunksNN shadown table to _metadatachunksNN, for proper shadowName detection

* _metadata_text_NN shadow tables to _metadatatextNN

* SQLITE_VEC_VERSION_MAJOR SQLITE_VEC_VERSION_MINOR and SQLITE_VEC_VERSION_PATCH in sqlite-vec.h

* _info shadow table

* forgot to update aux snapshot?

* fix aux tests
  • Loading branch information
asg017 authored Nov 20, 2024
1 parent 9bfeaa7 commit 352f953
Show file tree
Hide file tree
Showing 21 changed files with 7,366 additions and 110 deletions.
1 change: 1 addition & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
- main
- partition-by
- auxiliary
- metadata-filtering
permissions:
contents: read
jobs:
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,8 @@ sqlite-vec.h
tmp/

poetry.lock

*.jsonl

memstat.c
memstat.*
82 changes: 75 additions & 7 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,51 @@
# `sqlite-vec` Architecture

Internal documentation for how `sqlite-vec` works under-the-hood. Not meant for
users of the `sqlite-vec` project, consult
[the official `sqlite-vec` documentation](https://alexgarcia.xyz/sqlite-vec) for
how-to-guides. Rather, this is for people interested in how `sqlite-vec` works
and some guidelines to any future contributors.

Very much a WIP.

## `vec0`

### Shadow Tables

#### `xyz_chunks`

- `chunk_id INTEGER`
- `size INTEGER`
- `validity BLOB`
- `rowids BLOB`

#### `xyz_rowids`

- `rowid INTEGER`
- `id`
- `chunk_id INTEGER`
- `chunk_offset INTEGER`

#### `xyz_vector_chunksNN`

- `rowid INTEGER`
- `vector BLOB`

#### `xyz_auxiliary`

- `rowid INTEGER`
- `valueNN [type]`

#### `xyz_metadatachunksNN`

- `rowid INTEGER`
- `data BLOB`

#### `xyz_metadatatextNN`

- `rowid INTEGER`
- `data TEXT`

### idxStr

The `vec0` idxStr is a string composed of single "header" character and 0 or
Expand All @@ -14,8 +60,11 @@ The "header" charcter denotes the type of query plan, as determined by the
| `VEC0_QUERY_PLAN_POINT` | `'2'` | Perform a single-lookup point query for the provided rowid |
| `VEC0_QUERY_PLAN_KNN` | `'3'` | Perform a KNN-style query on the provided query vector and parameters. |

Each 4-character "block" is associated with a corresponding value in `argv[]`. For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is associated with `argv[2]` and so on. Each block describes what kind of value or filter the given `argv[i]` value is.

Each 4-character "block" is associated with a corresponding value in `argv[]`.
For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and
is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is
associated with `argv[2]` and so on. Each block describes what kind of value or
filter the given `argv[i]` value is.

#### `VEC0_IDXSTR_KIND_KNN_MATCH` (`'{'`)

Expand All @@ -31,24 +80,43 @@ The remaining 3 characters of the block are `_` fillers.

#### `VEC0_IDXSTR_KIND_KNN_ROWID_IN` (`'['`)

`argv[i]` is the optional `rowid in (...)` value, and must be handled with [`sqlite3_vtab_in_first()` /
`sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
`argv[i]` is the optional `rowid in (...)` value, and must be handled with
[`sqlite3_vtab_in_first()` / `sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).

The remaining 3 characters of the block are `_` fillers.

#### `VEC0_IDXSTR_KIND_KNN_PARTITON_CONSTRAINT` (`']'`)

`argv[i]` is a "constraint" on a specific partition key.

The second character of the block denotes which partition key to filter on, using `A` to denote the first partition key column, `B` for the second, etc. It is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
The second character of the block denotes which partition key to filter on,
using `A` to denote the first partition key column, `B` for the second, etc. It
is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.

The third character of the block denotes which operator is used in the constraint. It will be one of the values of `enum vec0_partition_operator`, as only a subset of operations are supported on partition keys.
The third character of the block denotes which operator is used in the
constraint. It will be one of the values of `enum vec0_partition_operator`, as
only a subset of operations are supported on partition keys.

The fourth character of the block is a `_` filler.


#### `VEC0_IDXSTR_KIND_POINT_ID` (`'!'`)

`argv[i]` is the value of the rowid or id to match against for the point query.

The remaining 3 characters of the block are `_` fillers.

#### `VEC0_IDXSTR_KIND_METADATA_CONSTRAINT` (`'&'`)

`argv[i]` is the value of the `WHERE` constraint for a metdata column in a KNN
query.

The second character of the block denotes which metadata column the constraint
belongs to, using `A` to denote the first metadata column column, `B` for the
second, etc. It is encoded with `'A' + metadata_idx` and can be decoded with
`c - 'A'`.

The third character of the block is the constraint operator. It will be one of
`enum vec0_metadata_operator`, as only a subset of operators are supported on
metadata column KNN filters.

The foruth character of the block is a `_` filler.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,9 @@ sqlite-vec.h: sqlite-vec.h.tmpl VERSION
VERSION=$(shell cat VERSION) \
DATE=$(shell date -r VERSION +'%FT%TZ%z') \
SOURCE=$(shell git log -n 1 --pretty=format:%H -- VERSION) \
VERSION_MAJOR=$$(echo $$VERSION | cut -d. -f1) \
VERSION_MINOR=$$(echo $$VERSION | cut -d. -f2) \
VERSION_PATCH=$$(echo $$VERSION | cut -d. -f3 | cut -d- -f1) \
envsubst < $< > $@

clean:
Expand Down
28 changes: 16 additions & 12 deletions TODO
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# partition
- [ ] add `xyz_info` shadow table with version etc.

- [ ] UPDATE on partition key values
- remove previous row from chunk, insert into new one?
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling

# auxiliary columns

- later:
- NOT NULL?
- perf: INSERT stmt should be cached on vec0_vtab
- perf: LEFT JOIN aux table to rowids query in vec0_cursor for rowid/point
stmts, to avoid N lookup queries
- later
- [ ] partition: UPDATE support
- [ ] skip invalid validity entries in knn filter?
- [ ] nulls in metadata
- [ ] partition `x in (...)` handling
- [ ] blobs/date/datetime
- [ ] uuid/ulid perf
- [ ] Aux columns: `NOT NULL` constraint
- [ ] Metadata columns: `NOT NULL` constraint
- [ ] Partiion key: `NOT NULL` constraint
- [ ] dictionary encoding?
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
- [ ] perf
- [ ] aux: cache INSERT
- [ ] aux: LEFT JOIN on `_rowids` queries to avoid N lookup queries
Loading

0 comments on commit 352f953

Please sign in to comment.