Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move DB to bincode #302

Merged
merged 9 commits into from
Aug 2, 2023

Conversation

bluskript
Copy link
Contributor

@bluskript bluskript commented Jul 27, 2023

I have verified this works with load_graphs_for_file_or_directory now, solving the issue I was facing.

In terms of compatibility it's best to increment the database version whenever a change to the data structures being serialized is made, otherwise it will likely fail to deserialize.

Side note: there's likely a lot of room for optimization if this project were to use RocksDB instead of Sqlite. Currently, there doesn't seem to be much usage of SQL's relational features, so it might be worth considering to reduce the database overhead as much as possible.

Also added a Nix flake, this allows any developer to easily spin up a reproducible development environment by running nix develop in the folder, as well as exposing tree-sitter-stack-graphs as a Nix package (note: for this to work Cargo.lock needs to be committed to ensure dependencies are downloaded as the exact versions they are defined as).

@bluskript bluskript requested a review from a team as a code owner July 27, 2023 22:13
@hendrikvanantwerpen
Copy link
Collaborator

I think you are right that it is time to consider other serialization options than those based on serde. Before I commit to one, I want to outline the rough requirements that I have for the serialization:

  1. Mature project. We want to make sure the project is maintained, security vulnerabilities get fixed etc.
  2. Stable wire format. Databases should remain compatible when we update the serialization crate, when another/newer Rust version is used, etc. Ideally, it would only change if the data structures we're serializing change.
  3. Reasonable performance. With the previous constraints, we can see what performs well. Since we're reading from disk, size might weight heavier than pure encoding/decoding speed.

I looked a bit through some of the options, guided by the benchmark list.

  • Something like bincode (especially with varint encoding) seems to fit the bill quite well.
  • Perhaps bitcode is a candidate, although they are less strict with maintaining their wire format.
  • For rkyv, which you propose, I worry that the wire format could easily change with library and/or Rust updates, because it is so tied to the in-memory layout of the data. I couldn't find many details on it, but the fact that this is not an explicitly stated goal worries me.

What are your thoughts, @bluskript?

@hendrikvanantwerpen
Copy link
Collaborator

Regarding the choice for SQLite, that is definitely something that could change. I'm not very happy with its performance, but lack of time and experience with other databases have prevented me from trying something else. There would be similar requirements for a database alternative as for the serialization crate.

@hendrikvanantwerpen
Copy link
Collaborator

Finally, regarding the Nix flake. I'm a little reluctant to add things that we are not necessarily able to maintain, and I think these flakes could be hosted outside of this repo as well. But more importantly, committing the Cargo.lock file is not going to happen for a library like this. This follows the recommendation from the Rust book:

If you’re building a non-end product, such as a rust library that other rust packages will depend on, put Cargo.lock in your .gitignore.

@bluskript
Copy link
Contributor Author

thanks for recommending bincode, i agree its probably a better fit than rkyv for this. I will adjust PR accordingly.

I wanted to migrate the codebase to RocksDB earlier, but the fact that there was an SQliteWriter and SQLiteReader made me concerned about making too many breaking changes to the library :(

To address the same concern as serialization, RocksDB is very stable and maintained by Facebook so it isn't going anywhere, its a simple KV binary-to-binary database which would mean every operation is as fine-grained and lightweight as it gets.

About the Nix flake and Cargo.lock, the primary reason Cargo.lock is recommended to be put in gitignore is for CIs to test with the latest versions instead of being pinned to the versions the project specifies, which is something that can be fixed with 2 lines in CI. imo at least its not a very good rust standard practice, and it unfortunately conflicts with Nix's goal of perfect reproducibility. The main reason I wanted to add it to the project itself and not an outside repo is because it enables people with nix to be able to get a development environment with all of the necessary dependencies by simply running nix develop, removing the need to worry about having a sqlite to link to and for having to pick the right rust toolchain. I understand if you don't think it belongs in this repo though.

@hendrikvanantwerpen
Copy link
Collaborator

thanks for recommending bincode, i agree its probably a better fit than rkyv for this. I will adjust PR accordingly.

Great! In terms of technicalities, I'm thinking to use conditional attributes to derive the various traits depending on whether serde or storage features are enabled. Something like #[cfg_attr(feature = "storage", derive(bincode::Decode, bincode::Encode)] and #[cfg_attr(feature = "storage", derive(serde::Decoder, serde::Encoder)]. It's a bit verbose perhaps, but that way users keep control over the dependencies that are being pulled in. Perhaps the serde module itself can be unconditional then, since it's just some data types. The bincode dependency itself would be optional, and required by the storage feature.

How does that sound?

I wanted to migrate the codebase to RocksDB earlier, but the fact that there was an SQliteWriter and SQLiteReader made me concerned about making too many breaking changes to the library :( To address the same concern as serialization, RocksDB is very stable and maintained by Facebook so it isn't going anywhere, its a simple KV binary-to-binary database which would mean every operation is as fine-grained and lightweight as it gets.

Let's not do everything in one PR :) Thanks for adding some context on RocksDB. I'll read up on it early next week to learn a bit more about it and see if it could be useful.

I understand if you don't think it belongs in this repo though.

Re Cargo.lock, I know opinion is divided on this. I'll ask my colleagues what their feeling about it is, but we might stay with the status quo for now.

@hendrikvanantwerpen
Copy link
Collaborator

Note that I just merged #303, which adds a CI test that triggers the deserialization bug. Once this branch is up-to-date with main and you think it's ready to try a CI run, we should have some feedback on how it works.

@bluskript
Copy link
Contributor Author

Just pushed a new change that moved from rkyv to bincode, as you described its completely conditional and only enables when the storage feature is enabled. I tested it manually and it seems to be working.

@bluskript bluskript force-pushed the work/blusk/rkyv-storage branch 4 times, most recently from a4cc797 to 8ee3a6a Compare July 28, 2023 19:09
@bluskript bluskript changed the title move DB to rkyv move DB to bincode Jul 28, 2023
Copy link
Collaborator

@hendrikvanantwerpen hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I only hope we can depend on a stable version and move to v2 once that's properly released.

I'm waiting on review for #304 before the performance test CI job works properly. If you want, you can already merge in the changes from that branch, and maybe just use it as the base branch for this PR?

lsp-positions/Cargo.toml Show resolved Hide resolved
stack-graphs/src/storage.rs Outdated Show resolved Hide resolved
@hendrikvanantwerpen
Copy link
Collaborator

I started reviewing the code and found out that my suggestion to depend on bincode 1.3.3 is probably not going to work. This version depends on serde and will have the problem with skipped attributes like postcard had. Although it's not ideal, I'm thinking that we should use the latest RC of v2 after all. Then we can use the bincode specific derives, which will be independent of any serde directives.

@hendrikvanantwerpen
Copy link
Collaborator

@bluskript I did some minor cleanup, which I hoped to push directly to this branch, but I couldn't. So I opened bluskript#1 so you can merge them. I think with those, this will be good to merge.

Make storage feature independent of serde
@hendrikvanantwerpen hendrikvanantwerpen merged commit ee490e9 into github:main Aug 2, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants