-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate possibility of replacing SQLite with RocksDB #306
Comments
Splitting up the indexing into two places sounds like it would be a good idea. Only concern might be consistency i.e. only one of the indexes is populated but not the other. We should also use column families so that they can be written to independently better and partitioned. Another concern I have is that with this approach, we can't easily find "all paths that belong to a single file" because theres one mapping between file and symbol stacks, and the other is symbol stack to file path, so it would need more than one query. With this proposal, the following mapping would make sense I believe:
Are symbol stacks guaranteed to be unique? If so, how does it deal with duplicate files? With regards to dropping in-memory and neutral naming, key-value being simpler gives us the benefit of being able to create abstractions more easily. I propose we design everything with RocksDB first, then once the implementation has settled, we create a |
Collecting the current database schema and queries here so we have a compact overview of the access patterns we'd need to support (leaving out strict and not null requirements for now). Schema
Notes:
Queries
So, in terms of performance, the last three queries are the once to focus on. |
From having a look at your schema: stack-graphs/stack-graphs/src/storage.rs Lines 37 to 59 in ee490e9
It seems like you don't have indexes on the things you're SELECTing, with the exception of CREATE INDEX file_paths_file_local_id ON file_paths(file, local_id);
CREATE INDEX root_paths_symbol_stack ON root_paths(symbol_stack); If the data that you're fetching is in the index then SQLite won't even bother reading the row proper, so instead of the above some even better indexes would be: CREATE INDEX graph_file_value ON graph(file, value);
CREATE INDEX file_paths_file_local_id_value ON file_paths(file, local_id, value);
CREATE INDEX root_paths_symbol_stack_file_value ON root_paths(symbol_stack, file, value); If you use |
In github#306 it's reported that SQLite is slow. This adds indexes so lookups will be faster. The intention is to speed up the bottom 3 queries from github#306 (comment) as they are "the ones to focus on". Specifically this should speed up: > * Load graph data for a file > > SELECT value FROM graphs WHERE file = ? > > Called many times during path stitching > > * Load paths for a file node > > SELECT file,value from file_paths WHERE file = ? AND local_id = ? > > Called many times during path stitching. > > * Load paths for root node > > SELECT file, value from root_paths WHERE symbol_stack = ? > > Called many times during path stitching. If the data that you're fetching is in the index then SQLite won't even bother reading the row proper, just pulling the data streight from the index (good for data locality). As such I've included not just the columns we're using in the `WHERE` clause, but also the ones from `SELECT` too. I've not actually tested the performance impact as I'm not familiar with this project (this is more of a drive-by) and don't know what benchmarks to use.
Raised a PR here: #308 |
In github#306 it's reported that SQLite is slow. This adds indexes so lookups will be faster. The intention is to speed up the bottom 3 queries from github#306 (comment) as they are "the ones to focus on". Specifically this should speed up: > * Load graph data for a file > > SELECT value FROM graphs WHERE file = ? > > Called many times during path stitching > > * Load paths for a file node > > SELECT file,value from file_paths WHERE file = ? AND local_id = ? > > Called many times during path stitching. > > * Load paths for root node > > SELECT file, value from root_paths WHERE symbol_stack = ? > > Called many times during path stitching. If the data that you're fetching is in the index then SQLite won't even bother reading the row proper, just pulling the data streight from the index (good for data locality). As such I've included not just the columns we're using in the `WHERE` clause, but also the ones from `SELECT` too. I've not actually tested the performance impact as I'm not familiar with this project (this is more of a drive-by) and don't know what benchmarks to use.
awesome, i noticed the same thing when refactoring the database. wonder what the performance will be like after this gets merged. |
In github#306 it's reported that SQLite is slow. This adds indexes so lookups will be faster. The intention is to speed up the bottom 3 queries from github#306 (comment) as they are "the ones to focus on". Specifically this should speed up: > * Load graph data for a file > > SELECT value FROM graphs WHERE file = ? > > Called many times during path stitching > > * Load paths for a file node > > SELECT file,value from file_paths WHERE file = ? AND local_id = ? > > Called many times during path stitching. > > * Load paths for root node > > SELECT file, value from root_paths WHERE symbol_stack = ? > > Called many times during path stitching. If the data that you're fetching is in the index then SQLite won't even bother reading the row proper, just pulling the data streight from the index (good for data locality). As such I've included not just the columns we're using in the `WHERE` clause, but also the ones from `SELECT` too. I've not actually tested the performance impact as I'm not familiar with this project (this is more of a drive-by) and don't know what benchmarks to use.
@bluskript I just merged #308, would love to hear if this improves anything for you. |
Closing this now that SQLite performance has been improved. |
The current file based storage implementation is based on
rusqlite
. It seems to be rather slow in practice, and we don't really depend on the SQL features. @bluskript suggested to replace it with something else like RocksDB #302. I think this is worth investigating and making the change if it improves performance.Similar to the binary encoding we use, the database itself should be mature, have a stable storage format, and perform well. These are explicit goals of the RocksDB project, making it a good candidate.
Changes
I imagine the following changes to be part of this transition:
One of the main challenges will be maintaining the index of file paths from the root node. The paths coming from root are indexed by partial symbol stacks, and originate from multiple files. Therefore, it should be possible to add paths belonging to a particular file to this shared index, but also to remove the paths from a particular files from the index, while keeping the rest.
See if we can exploit RocksDB's range based indexing to efficiently load/remove data from the database (e.g., all paths belong to a single file). This will require careful key design.
Drop the support for having an in-memory database. RocksDB doesn't support this, and once Allow path stitcher to work on
SQLiteReader
#291 is merged, switching between a database and in-memory data structures should have little impact on code.Find more neutral naming for the storage classes. Currently, they explicitly mention
SQL
, but we should use something to suggest disk, file-based, or persistent storage, without exposing the implementation.The text was updated successfully, but these errors were encountered: