Lance(DB) without database connection - just static file(s) #3447

do-me · 2025-02-13T17:13:41Z

do-me
Feb 13, 2025

Hi folks,
I'm a big fan of Lance and the separation of concerns approach of LanceDB in general. Did I understand correctly, that LanceDB always requires a proper DB connection in the remote setup?

I'm exploring ways of using a static index dir/file that you can dump anywhere and that could be queried in <1s via range requests. I was wondering if you had any good ideas whether this was somehow possible with the lance data format and e.g. DuckDB or similar.

Hosting static files is cheap and often free (Github, Huggingface). My idea is that you could have a super lean, e.g. JS-only frontend retrieving data from a massive static index.

I wrote about this idea and a hacky but somewhat working research demo here: https://github.com/do-me/flatgeobuf-vectordb

I guess if one could somehow optimize ANN or HNSW for the columnar nature of the data format and range requests, there might be some kind of way.

Would love to hear you ideas and thoughts!

wjones127 · 2025-02-13T19:29:59Z

wjones127
Feb 13, 2025
Maintainer

My idea is that you could have a super lean, e.g. JS-only frontend retrieving data from a massive static index.

That would be pretty cool. I think for LanceDB, you would basically need to write a pure-JS implementation of Lance to read the Lance files and the table format. Right now, all of that is implemented in a Rust library. It wouldn't be a small lift to port that to JS, but I don't think it would be impossible.

0 replies

westonpace · 2025-02-14T01:43:59Z

westonpace
Feb 14, 2025
Maintainer

You might be able to compile a rust core down to wasm. The Lance file format would be overkill. We have lots of code for statistics, compression, legacy version of the format, etc. which would just bloat the wasm artifact. Arrow IPC would be a lighter weight option and there is already support for that in JS.

So I think the main thing you'd want to extract into a wasm core would be the lance-index (lance-linalg, etc.) code to actually do the search.

Less than 1s will probably be impossible for very large datasets (100s of millions of rows+). Your client is going to start with nothing cached in memory at all. This means you will have to load portions of the large index file into memory. For IVF/PQ this means a few rather large loads. For graph-based algorithms this would mean many small loads.

More realistic would probably be a 30-60 second loading time and then very fast searches after that.

2 replies

do-me Feb 14, 2025
Author

Compiling the fundamentals to wasm looks like the most promising idea. So at the moment there is nothing like LanceDB-JS/wasm or any third-party lance plugins for other DBs that already work in the browser like DuckDB-wasm?

That makes sense, somehow the client needs to have some kind of information of where exactly to look for, or speaking in range requests what bytes to retrieve.
Do you have some kind of idea how large such an index would be, or some kind of diagram how it grows in size / loses accuracy when growing? Datasets over 100M are rare, I was thinking more of smaller but this large ones like 1-10M, considering that you can fit 100k full embeddings with 5 decimals in 66Mb .json.gz anyway.

I think, for an MVP, it would be absolutely ok to have to load/retrieve a big index just once. That might be conveniently cached in the browser like embedding models with transformers.js. The index if probably always smaller than the embedding models anyway (24 - 500Mb).
In the long run, that would probably lead to the question whether such an index can be updated easily (accepting a potential little loss in accuracy), without having to recreate it from scratch, but that's not important at this stage.

Coming from geospatial, just as a side note and analogy, there are too approaches how to deal with retrieving small chunks of large datasets:

The protomaps way with range requests from a single file or
The openfreemap way of having millions of directories

Interesting discussion about what's faster here and here. Personally I'd favor range requests as it's just more elegant, easier to deal with and widely supported.

westonpace Feb 14, 2025
Maintainer

Do you have some kind of idea how large such an index would be, or some kind of diagram how it grows in size / loses accuracy when growing? Datasets over 100M are rare, I was thinking more of smaller but this large ones like 1-10M, considering that you can fit

It's all very tunable, and we have more detailed sizing guides on the site. It depends on the model and how aggressively you want to compress your data (you can think of a vector index as a lossy compressed version of the embedddings). Just picking some standard models and numbers I (could likely be wrong here) came up with anywhere from 100-800MB at 10M rows. The index size is more or less linear with the number of vectors.

You don't need all of that data for a single search. A single search can probably get away with a few MB or less data. However, if you do enough searches, then you will.

One problem is that most of these lossy compression techniques are pretty rough. If you care about good recall you normally need to do a second "refine" stage. So, for example, if you want the 10 best results you search the index to find the 50 closest. Then you grab the uncompressed version of those 50 vectors. This is 50 HTTP requests in your model. You can run them in parallel and each one returns about 4KB or less of data (the per request overhead will likely be more significant than the size of the data unless it's a slow connection). I have no good model for requests/s or bandwidth for "over the internet" but something to think about.

Also, you'd then want to take your top 10 results and go fetch them (e.g. fetch the images, captions, whatever other metadata you have).

Personally I'd favor range requests as it's just more elegant, easier to deal with and widely supported.

Oh yea, definitely prefer range requests. It's what we use for cloud storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance(DB) without database connection - just static file(s) #3447

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Lance(DB) without database connection - just static file(s) #3447

do-me Feb 13, 2025

Replies: 2 comments · 2 replies

wjones127 Feb 13, 2025 Maintainer

westonpace Feb 14, 2025 Maintainer

do-me Feb 14, 2025 Author

westonpace Feb 14, 2025 Maintainer

do-me
Feb 13, 2025

Replies: 2 comments 2 replies

wjones127
Feb 13, 2025
Maintainer

westonpace
Feb 14, 2025
Maintainer

do-me Feb 14, 2025
Author

westonpace Feb 14, 2025
Maintainer