Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possibility of replacing SQLite with RocksDB #306

Closed
hendrikvanantwerpen opened this issue Aug 2, 2023 · 7 comments
Closed

Comments

@hendrikvanantwerpen
Copy link
Collaborator

The current file based storage implementation is based on rusqlite. It seems to be rather slow in practice, and we don't really depend on the SQL features. @bluskript suggested to replace it with something else like RocksDB #302. I think this is worth investigating and making the change if it improves performance.

Similar to the binary encoding we use, the database itself should be mature, have a stable storage format, and perform well. These are explicit goals of the RocksDB project, making it a good candidate.

Changes

I imagine the following changes to be part of this transition:

  • One of the main challenges will be maintaining the index of file paths from the root node. The paths coming from root are indexed by partial symbol stacks, and originate from multiple files. Therefore, it should be possible to add paths belonging to a particular file to this shared index, but also to remove the paths from a particular files from the index, while keeping the rest.

    • Perhaps the problem is slightly simplified if we add an intermediate step. Per file an index of partial symbol stacks for root paths, and one global index for partial symbol stacks pointing to files contributing paths with that stack.
  • See if we can exploit RocksDB's range based indexing to efficiently load/remove data from the database (e.g., all paths belong to a single file). This will require careful key design.

  • Drop the support for having an in-memory database. RocksDB doesn't support this, and once Allow path stitcher to work on SQLiteReader #291 is merged, switching between a database and in-memory data structures should have little impact on code.

  • Find more neutral naming for the storage classes. Currently, they explicitly mention SQL, but we should use something to suggest disk, file-based, or persistent storage, without exposing the implementation.

@bluskript
Copy link
Contributor

bluskript commented Aug 4, 2023

Splitting up the indexing into two places sounds like it would be a good idea. Only concern might be consistency i.e. only one of the indexes is populated but not the other. We should also use column families so that they can be written to independently better and partitioned.

Another concern I have is that with this approach, we can't easily find "all paths that belong to a single file" because theres one mapping between file and symbol stacks, and the other is symbol stack to file path, so it would need more than one query.

With this proposal, the following mapping would make sense I believe:

f:${file_path} -> <serialized bincode array containing symbol stacks>
s:${symbol_stack} -> string path

Are symbol stacks guaranteed to be unique? If so, how does it deal with duplicate files?

With regards to dropping in-memory and neutral naming, key-value being simpler gives us the benefit of being able to create abstractions more easily. I propose we design everything with RocksDB first, then once the implementation has settled, we create a DB trait with a small set of functions (ideally 4-5) that provide the same functionality and use that instead.

@hendrikvanantwerpen
Copy link
Collaborator Author

Collecting the current database schema and queries here so we have a compact overview of the access patterns we'd need to support (leaving out strict and not null requirements for now).

Schema

TABLE graphs(
  file TEXT PRIMARY KEY, tag TEXT, error TEXT, value BLOB
);
TABLE file_paths(
  file TEXT, local_id INTEGER, value BLOB,
  FOREIGN KEY(file) REFERENCES graphs(file)
);
TABLE root_paths(
  file TEXT, symbol_stack TEXT, value BLOB NOT NULL,
  FOREIGN KEY(file) REFERENCES graphs(file)
);

Notes:

  • graphs is the only table with a unique key per row: file
  • file_paths does not have a unique key per row, but in blob storage we might key on (file, local_id) and store a list of paths.
  • root_paths is the crux, because of the access patterns below. It contains paths from the same file with different symbol keys, and paths with the same symbol key from different files.

Queries

  • Delete all data for a file:

    DELETE FROM file_paths WHERE file=?
    DELETE FROM root_paths WHERE file=?
    DELETE FROM graphs WHERE file=?
    

    This is called once per file that is indexed.

  • Delete all data for files that are inside a directory:

    DELETE FROM file_paths WHERE path_descendant_of(file, ?)
    DELETE FROM root_paths WHERE path_descendant_of(file, ?)
    DELETE FROM graphs WHERE path_descendant_of(file, ?)
    

    This is called once per clean commands.

  • Store data for a file:

    INSERT INTO graphs (file, tag, error, value) VALUES (?, ?, ?, ?)
    INSERT INTO file_paths (file, local_id, value) VALUES (?, ?, ?)
    INSERT INTO root_paths (file, symbol_stack, value) VALUES (?, ?, ?)
    

    Called once per indexed file.

  • List files that are inside a directory:

    SELECT file, tag, error FROM graphs WHERE path_descendant_of(file, ?)
    

    Called once per status command.

  • Load graph data for a file

    SELECT value FROM graphs WHERE file = ?
    

    Called many times during path stitching

  • Load paths for a file node

    SELECT file,value from file_paths WHERE file = ? AND local_id = ?
    

    Called many times during path stitching.

  • Load paths for root node

    SELECT file, value from root_paths WHERE symbol_stack = ?
    

    Called many times during path stitching.

So, in terms of performance, the last three queries are the once to focus on.

@wmanley
Copy link
Contributor

wmanley commented Aug 7, 2023

From having a look at your schema:

const SCHEMA: &str = r#"
CREATE TABLE metadata (
version INTEGER NOT NULL
) STRICT;
CREATE TABLE graphs (
file TEXT PRIMARY KEY,
tag TEXT NOT NULL,
error TEXT,
value BLOB NOT NULL
) STRICT;
CREATE TABLE file_paths (
file TEXT NOT NULL,
local_id INTEGER NOT NULL,
value BLOB NOT NULL,
FOREIGN KEY(file) REFERENCES graphs(file)
) STRICT;
CREATE TABLE root_paths (
file TEXT NOT NULL,
symbol_stack TEXT NOT NULL,
value BLOB NOT NULL,
FOREIGN KEY(file) REFERENCES graphs(file)
) STRICT;
"#;

It seems like you don't have indexes on the things you're SELECTing, with the exception of graphs.file, which has an index because it's a primary key. Based on the statements you listed as important you could add:

CREATE INDEX file_paths_file_local_id ON file_paths(file, local_id);
CREATE INDEX root_paths_symbol_stack ON root_paths(symbol_stack);

If the data that you're fetching is in the index then SQLite won't even bother reading the row proper, so instead of the above some even better indexes would be:

CREATE INDEX graph_file_value ON graph(file, value);
CREATE INDEX file_paths_file_local_id_value ON file_paths(file, local_id, value);
CREATE INDEX root_paths_symbol_stack_file_value ON root_paths(symbol_stack, file, value);

If you use CREATE INDEX IF NOT EXISTS then the whole thing is idempotent, so could be run every time you open the database.

wmanley added a commit to wmanley/stack-graphs that referenced this issue Aug 7, 2023
In github#306 it's reported that SQLite is slow.  This adds indexes so lookups will be
faster.  The intention is to speed up the bottom 3 queries from
github#306 (comment) as
they are "the ones to focus on".

Specifically this should speed up:

> * Load graph data for a file
>
>         SELECT value FROM graphs WHERE file = ?
>
>   Called many times during path stitching
>
> * Load paths for a file node
>
>         SELECT file,value from file_paths WHERE file = ? AND local_id = ?
>
>   Called many times during path stitching.
>
> * Load paths for root node
>
>         SELECT file, value from root_paths WHERE symbol_stack = ?
>
>   Called many times during path stitching.

If the data that you're fetching is in the index then SQLite won't even bother
reading the row proper, just pulling the data streight from the index (good for
data locality).  As such I've included not just the columns we're using in the
`WHERE` clause, but also the ones from `SELECT` too.

I've not actually tested the performance impact as I'm not familiar with this
project (this is more of a drive-by) and don't know what benchmarks to use.
@wmanley
Copy link
Contributor

wmanley commented Aug 7, 2023

Raised a PR here: #308

wmanley added a commit to wmanley/stack-graphs that referenced this issue Aug 7, 2023
In github#306 it's reported that SQLite is slow.  This adds indexes so lookups will be
faster.  The intention is to speed up the bottom 3 queries from
github#306 (comment) as
they are "the ones to focus on".

Specifically this should speed up:

> * Load graph data for a file
>
>         SELECT value FROM graphs WHERE file = ?
>
>   Called many times during path stitching
>
> * Load paths for a file node
>
>         SELECT file,value from file_paths WHERE file = ? AND local_id = ?
>
>   Called many times during path stitching.
>
> * Load paths for root node
>
>         SELECT file, value from root_paths WHERE symbol_stack = ?
>
>   Called many times during path stitching.

If the data that you're fetching is in the index then SQLite won't even bother
reading the row proper, just pulling the data streight from the index (good for
data locality).  As such I've included not just the columns we're using in the
`WHERE` clause, but also the ones from `SELECT` too.

I've not actually tested the performance impact as I'm not familiar with this
project (this is more of a drive-by) and don't know what benchmarks to use.
@bluskript
Copy link
Contributor

awesome, i noticed the same thing when refactoring the database. wonder what the performance will be like after this gets merged.

wmanley added a commit to wmanley/stack-graphs that referenced this issue Aug 7, 2023
In github#306 it's reported that SQLite is slow.  This adds indexes so lookups will be
faster.  The intention is to speed up the bottom 3 queries from
github#306 (comment) as
they are "the ones to focus on".

Specifically this should speed up:

> * Load graph data for a file
>
>         SELECT value FROM graphs WHERE file = ?
>
>   Called many times during path stitching
>
> * Load paths for a file node
>
>         SELECT file,value from file_paths WHERE file = ? AND local_id = ?
>
>   Called many times during path stitching.
>
> * Load paths for root node
>
>         SELECT file, value from root_paths WHERE symbol_stack = ?
>
>   Called many times during path stitching.

If the data that you're fetching is in the index then SQLite won't even bother
reading the row proper, just pulling the data streight from the index (good for
data locality).  As such I've included not just the columns we're using in the
`WHERE` clause, but also the ones from `SELECT` too.

I've not actually tested the performance impact as I'm not familiar with this
project (this is more of a drive-by) and don't know what benchmarks to use.
@hendrikvanantwerpen
Copy link
Collaborator Author

@bluskript I just merged #308, would love to hear if this improves anything for you.

@hendrikvanantwerpen
Copy link
Collaborator Author

Closing this now that SQLite performance has been improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants