|
| 1 | +--- |
| 2 | +title: "Build a Vector Extension for Postgres - Introduction" |
| 3 | +meta_title: "Build a Vector Extension for Postgres - Introduction" |
| 4 | +description: "" |
| 5 | +date: 2024-12-18T08:00:00Z |
| 6 | +image: "/images/posts/2024/build_a_vector_extension_for_postgres_introduction/bg.png" |
| 7 | +categories: ["vector database", "Postgres"] |
| 8 | +author: "SteveLauC" |
| 9 | +tags: ["vector database", "Postgres"] |
| 10 | +draft: false |
| 11 | +--- |
| 12 | + |
| 13 | +## Why and What |
| 14 | + |
| 15 | +Vector databases are really hot topics nowadays. I have always been curious about what they are and how they work under the hood, so let's build one ourselves. Building a whole new database from scratch is not practical, we need some building blocks, or, just a real database system. Postgres has a long-standing reputation for its extensibility, which makes it a perfect fit for our needs, and projects like [pgvector][pgvector] have already demonstrated it is viable to add vector support to Postgres as an extension. |
| 16 | + |
| 17 | +We are going to implement vector support for Postgres, but, what are the detailed features to implement? This is not a hard question, the definition of [Vector database][vector_db_wikipedia] from Wikipedia shows us the right direction: |
| 18 | + |
| 19 | +> A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor algorithms so that one can search the database with a query vector to retrieve the closest matching database records |
| 20 | +
|
| 21 | +Alright, so we need to enable Postgres to store vectors, and be able to perform Top-K queries, i.e., for a given input vector, Postgres should return the K vectors that are most similar (or closest) to it. If we express them in SQL, it would look like this: |
| 22 | + |
| 23 | +```sql |
| 24 | +-- Create a table, which has a column of type `vector(3)`, 3 is the dimension of the vector |
| 25 | +CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3)); |
| 26 | + |
| 27 | +-- Insert vectors, Postgres should store them! |
| 28 | +INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]'); |
| 29 | + |
| 30 | +-- Now, Postgres should return the Top-5 vectors that are most similar to |
| 31 | +-- [3, 1, 2] |
| 32 | +SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 5; |
| 33 | +``` |
| 34 | + |
| 35 | +Example SQL makes things clear, to summarize, we need to: |
| 36 | + |
| 37 | +1. Implement a `vector` type for Postgres, it should accept a dimension parameter |
| 38 | +2. Implement that `<=>` binary operator, which should calculate the similarity of the given 2 vectors and return it |
| 39 | + |
| 40 | +## Set up the environment |
| 41 | + |
| 42 | +I will use the Rust language and a library called [`pgrx`][pgrx], to install Rust you simply need to follow the instructions [here][install_rust], then you run this command to set up `cargo-pgrx`, a cargo sub-command to manage everything related to `pgrx`: |
| 43 | + |
| 44 | +```sh |
| 45 | +$ cargo install --locked cargo-pgrx |
| 46 | +$ cargo pgrx --version # to verify that it gets installed |
| 47 | +``` |
| 48 | + |
| 49 | +Now we need a Postgres server to run and test our project, I would just let `pgrx` install a brand new Postgres for me to make things easier. At the time of writing, [Postgres 17 is the latest version][pg17_release], so I will use it. |
| 50 | + |
| 51 | +`pgrx` builds Postgres from source, so you need to ensure these [requirements][build_pg_requirements] are satisfied. `pgrx` also has a page about the [system requirements][pgrx_system_requiremens], but Postgres is really well-documented, it deserves a read. Once you have everything set up, run: |
| 52 | + |
| 53 | +```sh |
| 54 | +$ cargo pgrx init --pg17 download |
| 55 | +``` |
| 56 | + |
| 57 | +## The initial commit |
| 58 | + |
| 59 | +Now let's write some code, `cargo pgrx`, just like `cargo`, provides a `new` sub-command to create new projects, say we call our project `pg_vector_ext`, run: |
| 60 | + |
| 61 | +```sh |
| 62 | +$ cargo pgrx new pg_vector_ext |
| 63 | +``` |
| 64 | + |
| 65 | +```sh |
| 66 | +$ cd pg_vector_ext |
| 67 | +$ tree . |
| 68 | +pg_vector_ext/ |
| 69 | +├── Cargo.toml |
| 70 | +├── pg_vector_ext.control |
| 71 | +├── sql |
| 72 | +└── src |
| 73 | + ├── bin |
| 74 | + │ └── pgrx_embed.rs |
| 75 | + └── lib.rs |
| 76 | + |
| 77 | +4 directories, 4 files |
| 78 | +``` |
| 79 | + |
| 80 | +From this, we can see, `pgrx` creates some template files for us. For now, we only care about the `src/lib.rs` file. |
| 81 | + |
| 82 | +```sh |
| 83 | +$ bat src/lib.rs |
| 84 | +───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 85 | + │ File: src/lib.rs |
| 86 | +───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 87 | + 1 │ use pgrx::prelude::*; |
| 88 | + 2 │ |
| 89 | + 3 │ ::pgrx::pg_module_magic!(); |
| 90 | + 4 │ |
| 91 | + 5 │ #[pg_extern] |
| 92 | + 6 │ fn hello_pg_vector_ext() -> &'static str { |
| 93 | + 7 │ "Hello, pg_vector_ext" |
| 94 | + 8 │ } |
| 95 | + 9 │ |
| 96 | + 10 │ #[cfg(any(test, feature = "pg_test"))] |
| 97 | + 11 │ #[pg_schema] |
| 98 | + 12 │ mod tests { |
| 99 | + 13 │ use pgrx::prelude::*; |
| 100 | + 14 │ |
| 101 | + 15 │ #[pg_test] |
| 102 | + 16 │ fn test_hello_pg_vector_ext() { |
| 103 | + 17 │ assert_eq!("Hello, pg_vector_ext", crate::hello_pg_vector_ext()); |
| 104 | + 18 │ } |
| 105 | + 19 │ |
| 106 | + 20 │ } |
| 107 | + 21 │ |
| 108 | + 22 │ /// This module is required by `cargo pgrx test` invocations. |
| 109 | + 23 │ /// It must be visible at the root of your extension crate. |
| 110 | + 24 │ #[cfg(test)] |
| 111 | + 25 │ pub mod pg_test { |
| 112 | + 26 │ pub fn setup(_options: Vec<&str>) { |
| 113 | + 27 │ // perform one-off initialization when the pg_test framework starts |
| 114 | + 28 │ } |
| 115 | + 29 │ |
| 116 | + 30 │ #[must_use] |
| 117 | + 31 │ pub fn postgresql_conf_options() -> Vec<&'static str> { |
| 118 | + 32 │ // return any postgresql.conf settings that are required for your tests |
| 119 | + 33 │ vec![] |
| 120 | + 34 │ } |
| 121 | + 35 │ } |
| 122 | +───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── |
| 123 | +``` |
| 124 | +
|
| 125 | +Ignore the `tests` module (as it is for testing), we can see that `pgrx` creates a function `hello_pg_vector_ext()`, this is something callable in SQL, if we run the project via: |
| 126 | +
|
| 127 | +> Before running it, you need to edit the `Cargo.toml` file, within the `features` section, change the default feature to `pg17`, and optionally, you can remove the `pg*` features other than `pg17` as they won't be used: |
| 128 | +> |
| 129 | +> ```toml |
| 130 | +> [features] |
| 131 | +> default = ["pg17"] |
| 132 | +> pg17 = ["pgrx/pg17", "pgrx-tests/pg17" ] |
| 133 | +> pg_test = [] |
| 134 | +> ``` |
| 135 | +
|
| 136 | +```sh |
| 137 | +$ cargo pgrx run |
| 138 | +``` |
| 139 | +
|
| 140 | +It will start the Postgres 17 instance and connect to it via `psql`, we can install our extension and run the function: |
| 141 | +
|
| 142 | +```sql |
| 143 | +pg_vector_ext=# CREATE EXTENSION pg_vector_ext; |
| 144 | +CREATE EXTENSION |
| 145 | +pg_vector_ext=# SELECT hello_pg_vector_ext(); |
| 146 | + hello_pg_vector_ext |
| 147 | +---------------------- |
| 148 | + Hello, pg_vector_ext |
| 149 | +(1 row) |
| 150 | +``` |
| 151 | +
|
| 152 | +This is our first attempt at `pgrx` and also our first commit to the project. In the next post, I will implement the `vector` type so that Postgres can store vectors. |
| 153 | +
|
| 154 | +[pgvector]: https://github.com/pgvector/pgvector |
| 155 | +[vector_db_wikipedia]: https://en.wikipedia.org/wiki/Vector_database |
| 156 | +[pgrx]: https://github.com/pgcentralfoundation/pgrx |
| 157 | +[install_rust]: https://www.rust-lang.org/tools/install |
| 158 | +[pg17_release]: https://www.postgresql.org/about/news/postgresql-17-released-2936/ |
| 159 | +[build_pg_requirements]: https://www.postgresql.org/docs/current/install-requirements.html |
| 160 | +[pgrx_system_requiremens]: https://github.com/pgcentralfoundation/pgrx/?tab=readme-ov-file#system-requirements |
0 commit comments