Machine learning use case / proposal #47

vade · 2022-08-22T20:00:35Z

vade
Aug 22, 2022

Hi friends.

Firstly, SurrealDB looks really interesting, well put together and well documented. Its clear a lot of effort has gone into it.

I wanted to see if some machine learning features made sense within SurrealDB, as its existing suite of features make it nearly perfect for ML semantic search; especially considering geo-coordinates are included / planned.

Some features that would be really interesting:

High dimensional vectors as a native type. Typically embeddings can be anywhere from 128 dimensional float vectors to 10k + length vectors. Being able have multiple vector columns would be very helpful for enterprise products as many DB's assume one vector per table, which is troublesome.
Approximate nearest neighbor search. Using some sort of ANN engine like HNSW would allow for fast indexes to be built for approximate nearest neighbor recall of features 'near' an exemplar using some distance metric like Euclidean or Cosine similarity. This is useful for semantic search, for image matching / similar image finding, and much more.
Pre filtering / Hybrid search. The ability to pre-filter semantic / vector search using more traditional where clauses. This is where things get interesting and allow for powerful queries that leverage tabular data and vector semantic sorting. The leaders in Hybrid Vector DBs like Weaviate / Milvus do pre-sorting before dispatching / using the ANN engine to guarantee relevant results (post filtering is tricky and has lots of pitfalls).

Im aware this is a BROAD discussion and possibly out of scope, but I wanted to introduce the idea. Also note, Im brand new to SurrealDB and have no clue how realistic the above requests are, but I will say this is where things are going!

Thank you.

vade · 2022-08-22T20:02:20Z

vade
Aug 22, 2022
Author

I also wanted to note many of the existing Hybrid DB solutions dont support ACID, and have limited scope for migrations, for backups, transactions and rollbacks. It makes deploying those solutions tricky, and end users wanting more 'RDB' like features in Hybrid DBs.

0 replies

yxdunc · 2022-12-24T09:43:59Z

yxdunc
Dec 24, 2022

I could consider making a pull request to add vector support with an ANN algorithm.
In the contribution guidelines I saw we should go through the RFC process but the link gives me a 404 (maybe a private repo?).

How could we validate this feature @tobiemh before considering any actual implementation?

11 replies

vade Jan 7, 2023
Author

Also sorry to butt in, hopefully the above is helpful, not intending to be patronizing! <3

tobiemh Jan 7, 2023
Maintainer

@vade could you give some examples of how this would be used/queried from within SurrealQL (in an SQL query itself). As in, could you give non-working made-up examples of how this might be done/queried in SQL?

vade Jan 7, 2023
Author

I can't believe this worked:

SELECT image_path, image_tile, image_UUID, embeddings
FROM images
WHERE image_UUID = 'your-made-up-UUID'
LIMIT 1

INTO query_embedding;

SELECT image_path, image_tile, image_UUID, embeddings
FROM images
WHERE similarTo(embeddings, query_embedding) > 0.8
LIMIT 20;

tobiemh Jan 8, 2023
Maintainer

Very helpful @vade. Would it be possible to provide suggestions / examples for data ingestion?

vade Jan 8, 2023
Author

If just use an insert into the embeddings column? like

 insert into images (embeddings)
values ('0.01, 0.00,0.15, 0.0243')

or some such?

Jacse · 2023-02-13T12:59:39Z

Jacse
Feb 13, 2023

Hi @tobiemh @vade I've begun on an initial implementation. I can't find the RFC-process described anywhere, should I open an issue describing design and implementation details/questions that we should settle on?

My suggestions is to first implement a working vector type with appropriate functions (distances functions, math, normalization etc.) and then afterwards focus on providing approximate indexing

8 replies

tobiemh Feb 14, 2023
Maintainer

I'm easy with either! What changes have you made so far, and what were the implementation details that you weee thinking of?

This is what I had started on:

Storing of vectors

Arrays (vectors) can already be stored in SurrealDB. These could already store floating point numbers (embeddings), and could store 1536+ f64 values.

Defining an embedding

We would want to improve the method for defining a field which stores an embedding. Currently you could do...

DEFINE FIELD embedding ON TABLE comment TYPE array;
DEFINE FIELD embedding.* ON TABLE comment TYPE float;

However, this does not guarantee that an embedding is always a certain length (1536 floats long). So perhaps we could improve this by enabling something like the following...

DEFINE FIELD embedding ON TABLE comment TYPE array[float, 1536];

Comparing embeddings

When comparing embeddings, there are two methods. The first is brute force, reading every record in the table, and comparing the embeddings - which obviously won't scale. The second is creating an INDEX for the embeddings. However it's important to use comparison algorithms which can be utilised on top of the underlying key-value storage layer so that this functionality can work in both single-node and distributed scenarios. As a result, we would need to use Z-Order Indexing for similarity searching of embeddings in the INDEX.

Summary

With this as a starting point, we could use 3rd party embedding providers, and store the embeddings in SurrealDB, allowing us to analyse and compare (and therefore search) those records which match the specified embedding.

Would be great to get your thoughts on this!

yxdunc Feb 14, 2023

Comparing embeddings

I think the z-order indexing will work well for the KNN search if we use L1 or L2 norm . Although, I think it will not be trivial to make it work for cosine similarity, I'm guessing one solution could be to construct the index with a different space filling curve to work with ranges on angles.

Of course, we would create a Z-Index running through all the dimensions of the embeddings.

Once we have the Z-index and associated search functions we can efficiently search the nearest neighbours by defining a maximum search distance to the query-embedding. This maximum search distance let's us define a hypercube (a set of ranges on all the dimensions) around our query embedding.

With the Z-Index we efficiently get the number of neighbouring embeddings within the defined hypercube. Then if the number of results is bigger than the expected number of nearest neighbours (K) we can do a binary search by varying the size of the hypercube (maximum search distance) until we get the expected number (K) of nearest neighbours.

I would be happy to implement this part in rust and then let you integrate it to Surreal DB as I'm not yet familiar with your codebase.

Jacse Feb 14, 2023

I'm easy with either! What changes have you made so far, and what were the implementation details that you were thinking of?

I actually made a new type, but I think your solution with reusing arrays is better. I'll write comments here and include my other thoughts in the end.

Storing of vectors

Requirements for embeddings/vectors as I see it:

Math operations:
- Addition, subtraction, multiplication with scalar and other vectors
- Normalization of vectors (make them unit length)
Distance calculations between two vectors (L1, L2, cosine distance, and dot product)

In my initial implementation I actually created a new type where I then created these functions (we can optimize the operations with SIMD instructions/blas). You query similar movies like so:

SELECT * FROM movies WHERE vector::dist::cosine((SELECT embedding FROM movies WHERE id='lotr3'), embedding) > 0.8

We need to be able to distinguish between a normal array and a vector, but maybe we can do that by just looking at the type/field definition.

DEFINE FIELD embedding ON TABLE comment TYPE array[float, 1536];

Often in semantic search we use cosine distance as distance measure. This can be done much more effeciently if all vectors are normalized (unit length), because cosine similarity is then simply the dot product between the vectors. Some vector databases have an option to define a vector as unit length and enforce this on insert. Given the laissez faire nature of current field type definitions, maybe this is not a path we want to go down now, and rather leave it to the application/user to ensure their vectors are correct.

Comparing embeddings

However it's important to use comparison algorithms which can be utilised on top of the underlying key-value storage layer so that this functionality can work in both single-node and distributed scenarios. As a result, we would need to use Z-Order Indexing for similarity searching of embeddings in the INDEX.

I haven't heard of anyone using z-order indexing for ANN, but I'll look into it. I myself thought about the current best-performing algorithms and thought we could implement HSNW. If you're not familiar, it's essentially a graph on multiple levels with increasing density. So the first level is sparse, the second more dense etc.

The advantages of this index is:

Really good performance (one of the best, used by most of not all vector databases)
Relatively simple to implement and understand
The index doesn't need to be built periodically or after everything has been insert, but can be constructed and maintained as data arrives
It is based on graphs, which surreal already supports

It might be a silly idea but my thought was that we would simply create "hidden" graphs between vectors to create this index. Like this:

Vector A <- hnsw_lvl1 ->                                                   Vector F <- hnsw_lvl1 -> Vector Z

Vector A <- hnsw_lvl2 -> Vector B <- hnsw_lvl2 -> Vector D <- hnsw_lvl2 -> Vector F <- hnsw_lvl2 -> Vector Z

When doing search (with any distance function) based on a query vector q we first traverse the top level moving from an entry vector (e.g. vector A) until we find a local minima (no links from the current vertex are closer to the query vector). Suppose this is Vector F. We then move down a level, and repeat by traversing level 2 in the same fashion from Vector F. Rinse and repeat.
In the end we take the nearest vectors, calculate their distance with brute force and return the scores.

To do this efficiently on top of kv store, we'd need to ensure that graph links are stored next to each other, but I think surreal already ensures this, right?

To construct this type of index (and other index types down the road) we'd need some new way to define other index types and their parameters.
Maybe something like

DEFINE INDEX @name ON [TABLE] @table FIELDS @fields [TYPE] @type

Example use:

DEFINE INDEX hnsw ON TABLE movies FIELDS embedding TYPE hnsw{ M = 64 }

We also need a way to define parameters. In your suggestion for array parameters you've used position-based parameters, and that definitely works for the array type, but I'm wondering if we should use named parameters instead to make it more general and usable e.g. in defining parameters for indexes like here. Alternatively we just make all index parameters positional as well, like:

DEFINE INDEX hnsw ON TABLE movies FIELDS embedding TYPE hnsw[64]

Jacse Feb 20, 2023

@tobiemh tagging you here in case you didn't get a notification

mysticaltech Apr 10, 2023

@tobiemh Pinging you as this has become super important with the huge massive rise in embeddings used to deal with LLMs. See Langchain vector store integrations, I would really love to see SurrealDB listed in there:

timinou · 2023-02-14T12:39:31Z

timinou
Feb 14, 2023

If you haven't come across it, this database is focused around vector stores, maybe there's room for inspiration? https://weaviate.com/ :)

1 reply

vade Feb 14, 2023
Author

FWIW, I use Weaviate, and its fucking awesome. Highly suggest it.

mysticaltech · 2023-04-25T19:57:17Z

mysticaltech
Apr 25, 2023

Folks, it's me again, vector embeddings and similarity search (via SVM) seem simple enough to pull off, you are missing out of the LLM crazy and massive AI apps adoption, please think about implementing this, so that we can have all our data live in one place. Supabase has pg-vector, you do not offer anything along these lines! It's very important.

10 replies

timinou Apr 25, 2023

As someone who's using SurrealDB and Weaviate, I don't think it's just about having embeddings and similarity search, but a whole approach to treating vector data and communication with generative pipelines.

SurrealDB isn't just a database built on Rust, but a whole approach to dealing with databases. What would SurrealQL look like with vectors? When I go into the details of the thought experiment, I'd understand why SurrealDB developers would prefer making their vision for SurrealQL happen before they integrate vector abilities.

Weaviate would suck at user data management, or database triggers; similarly, SurrealDB would suck at interfacing with LLM APIs, even with a vector store. They work really well together when they're bound to the domain space they're meant for.

Only my two cents 🤓

mysticaltech Apr 25, 2023

@timinou You bring up interesting points. However, look at how Supabase did it for Postgres, super simple and it works wonders and it's probably as or more used than Weaviate. Including a short video below where this is shown. I would love your thoughts on this!

supabase-vector.mp4

vade Apr 25, 2023
Author

I'd ask you to thoroughly benchmark PGVector vs Weaviate. In my tests, Weaviate is 10x - 40x faster at scale, using non faceted, or faceted search on indexed vectors.

Just saying. PGVector does get ease of use / install with existing tooling correct and is def a great tool in the toolbox.

mysticaltech Apr 25, 2023

Basically, it's just a normal table with a vector type and similarity search and it's super well integrated into the ecosystem and works wonders. What more do you need? In an ideal work, I would want this for SurrealDB.

mysticaltech Apr 25, 2023

I'd ask you to thoroughly benchmark PGVector vs Weaviate. In my tests, Weaviate is 10x - 40x faster at scale, using non faceted, or faceted search on indexed vectors.

Just saying. PGVector does get ease of use / install with existing tooling correct and is def a great tool in the toolbox.

Interesting! Maybe SurrealDB can fix this, as it's already way faster than Postgres, let alone Postgres with an extension.

Korolen · 2023-07-27T07:03:05Z

Korolen
Jul 27, 2023

Brilliant work on the database with the ultra-comprehensive capability, m'friends! :)

I'm with @mysticaltech on this one -- AI is popping off in a big way and won't stop increasing exponentially. For that reason, supporting AI use-cases I think is potentially a huge value-prop and dramatic business win going into the future.

So many app developers will want some basic similarity-search in their apps, for example; as a developer/architect, being able to throw your embeddings in alongside everything else, in the same transactions, etc -- that's very nice. Being able to do it all in one integrated system that can run anywhere, even if it doesn't have advanced ML features or high levels of convenience wrappers, is very attractive.

Just nailing natively the core functionality of high-performance indexed vector searching -- I think that's probably most of the way there, since the community can add convenience layers on top of that for you.

Not to mention, SurrealDB supports many use-cases such as running fully embedded that are literally impossible in almost any other DB, let alone the small subset of those DBs that support vector searching, so such support would make it stand out even further. It's a killer feature for me, for instance, since the embeddability is essential for offline-first and/or cross-platform apps/PWAs, and there's nothing else that offers this combination.

So, my take is that it would be quite a missed opportunity and thus sad if this type of functionality wasn't addressed, given that SurrealDB is otherwise such a complete sweep of a solution.

4 replies

mysticaltech Jul 30, 2023

Well said @Korolen, exactly! It's such a shame really, as it was almost perfect. Now, we have to use many data stores, manage and synchronize them all, unless using Supabase or Postgres, but then again no graph there, for that you either go with ArangoDB, Nebula, or Neo4j. So yeah, too bad.

tobiemh Jul 30, 2023
Maintainer

Hi @Korolen and @mysticaltech apologies for missing your comments. We are already working on a number of features related to this discussion, rest assured 😀!

mysticaltech Jul 30, 2023

Very good to hear that @tobiemh 🥳🚀

Korolen Jul 31, 2023

@tobiemh Fantastic, that's wonderful to hear! My goodness, am I beyond bullish on SurrealDB haha ;) And, soo grateful to y'all to get to be building with it 😁⚡️!✨

Sending my appreciation to you+Jamie and everyone involved. 😌🙏

Thanks for the reply, and best of luck with this and all other challenges.

naisofly · 2023-10-05T15:30:03Z

naisofly
Oct 5, 2023

Wondering what the folks in this thread think of the new features in SurrealDB? 💭

3 replies

mysticaltech Oct 6, 2023

Just seen SurrealML, looks hot 🔥

maxwellflitton Oct 6, 2023
Maintainer

I'm glad you're excited. I'm just working on getting the build right so it's still easy to deploy the database with machine learning inference ability. Do you have any plans to use machine learning?

mysticaltech Oct 7, 2023

Very happy to hear. At this point in time, no plan yet for that. But it's good to know that surrealdb is moving in that very important direction. Keep up the good work!

Machine learning use case / proposal #47

Replies: 7 comments · 37 replies

vade Aug 22, 2022 Author

vade Jan 7, 2023 Author

tobiemh Jan 7, 2023 Maintainer

vade Jan 7, 2023 Author

tobiemh Jan 8, 2023 Maintainer

vade Jan 8, 2023 Author

tobiemh Feb 14, 2023 Maintainer

Storing of vectors

Defining an embedding

Comparing embeddings

Summary

Comparing embeddings

Storing of vectors

Comparing embeddings

vade Feb 14, 2023 Author

vade Apr 25, 2023 Author

tobiemh Jul 30, 2023 Maintainer

maxwellflitton Oct 6, 2023 Maintainer

Replies: 7 comments 37 replies

vade
Aug 22, 2022
Author

vade Jan 7, 2023
Author

tobiemh Jan 7, 2023
Maintainer

vade Jan 7, 2023
Author

tobiemh Jan 8, 2023
Maintainer

vade Jan 8, 2023
Author

tobiemh Feb 14, 2023
Maintainer

vade Feb 14, 2023
Author

vade Apr 25, 2023
Author

tobiemh Jul 30, 2023
Maintainer

maxwellflitton Oct 6, 2023
Maintainer