Ideas for a query language

This is an idea for a level module, but am posting it here because it's too early in the morning/late at night to implement and this will notify exactly the right people.

level is pretty useful as a straight key value store, but sooner or later you need to query something else,
so you add an index. A lot of leveldb modules are about indexes. For example, [level-seach](https://github.com/dominictarr/level-search) which just indexes everything.

The trouble with all these indexes, is to manage them all - which index do you need for which query and when do you add a new index? there is a design trade off here.

It would be so much easier if you could just not think about indexes at all, just have filters,
just `forEach` (scan) over _all the data_ and return only what you want. totally functional and elegant,
just not performant.

But what if you could use that interface, and it would transparently use the right indexes when possible? (or even better, generate them when necessary)

Lets say we have the npm registry, and we want to retrieve all the modules written by @substack,
even though substack has written many modules, it is still less than 1% of the registry, so a full scan
would be 99% wasted effort. Lets say that the average registry document is 5k, that is `95000*0.99*5000=470mb` of unnecessary reads. `475mb` total.

But lets say that we indexed all the maintainers? for simplicity lets say that the average length of both module names and user names is 8 characters. Say we create an index like `{username}:{modulename}` and index this for every module, `95000*(8+8+1)` (1 for the separator) `=   1.6mb` of index data. but now we just need to read through 1% of the index (since they are ordered by username we can just jump straight to substack) `950*17 = 15200` that is only 15k of unnecessary data.
So, now we only read `15200 + 950*5000 = 4.7mb`

Using this index we increase storage requirements by 1.6 mb but increase read perf by 100x.

But what about a more complex query? suppose we want to find modules written by substack that depend (directly) on async? now, a simple way to do this would be scan through the modules written by substack and check whether `async` is in the deps. Lets say there was a time when substack used async, in 9 modules. 1% of the time. the scan costs us 4.7 mb, like before, but we discarded 99% of it.

Okay, so lets create a filter for this. `udm:{username}:{dep}:{module}` where `module` depends on `dep`
this index is more expensive because it has needs another 9 characters, plus we need a tag to distinguish it from the other indexes. `3+1+3*(8+1) = 29` so this index would need `2.75mb`.
how much data would it save us? now we could directly read out the 9 substack modules that use async, `29*9+ 9*5000 = 45k` this is 1000 times less! but here is the problem: how do we know we want to perform this query? what if we want to know all the users that depend on `optimist`? we'd also need a  `dum:{dep}:{username}:{module}` index - and so on, we want our database to be really really fast,
so lets just index every pair of properties twice!

Lets say there are 20 properties on average, how many combinations of 2 properties?
(`20!/(2!*(20-2)!)` which cancels down to `(20*19)/2 = 275`), times 2 because we need 2 indexes,
gives us: `550*29 = 15950`, that is 3 times the size of the document. if there where 30 properties it would be `40*39*29 = 45240`. doubling the size of the document increases the size of the index 3 times,
now 9 times the size of the original document.

Now, at some point, it's better to do a scan than have another index, because the indexes take up too much space, and optimize queries that you never use. Clearly this is dependant on what sort of queries you do in practice, _possibly_ you could have a system that automagically creates indexes when you start to use a query more. _maybe_ even could it know when you have an expensive query?

But is there a nice way that you can gracefully transition between those?
decouple the query interface from the indexes, so you can add and remove indexes
without changing the queries at all. And measuring the efficiency of the queries,
so that you know when new indexes are needed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ideas for a query language #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Ideas for a query language #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions