Deserialization/Storage of StackGraphs #114

maxwnewcomer · 2022-09-14T17:42:05Z

maxwnewcomer
Sep 14, 2022

I really like the the serialization implementation for StackGraphs. It enables the visualization implementation which is extremely useful. However, for storage the serialization isn't really useful without a deserialization implementation. Yet in a quote from the blog post:

With a stack graph available to us, we can implement “jump to definition:”

The user clicks on a reference.

We load in the stack graphs for each file in the commit, and merge them
together.

We perform a path-finding search starting from the reference node
corresponding to the symbol that the user clicked on, considering
symbol stacks and precedences to ensure that we don’t create any invalid
paths.

Any valid paths that we find represent the definitions that the reference
refers to. We display those in a hover card.

Since StackGraphs are stored for each file, is the point of the the Path stitching to connect precomputed file-based paths with external references?

Is that the reasoning behind only storing StackGraphs for files, and then merging them into another StackGraph? Is it not possible to modify select files within a StackGraph and then, in reference to the previous StackGraph, efficiently recompute paths?

I guess storing paths has a large worst case space complexity, but in practice methods/functions are often only referred to once or twice with a small minority being used widely throughout the repository.

I'm a little confused on how are those StackGraphs stored and loaded, and I would love to do the same.

For example, If I wanted to compute affected paths when comparing a PR on a branch, by 1) loading previous (repo-wide) stack graph for that branch, 2) checking affected files (or even lines) from a git diff on the PR, 3) recomputing paths and nodes for only those files/lines affected, and then 4) store a new StackGraph (or even just storing affected paths in relation to a higher-level branch), would it be possible?

Answered by hendrikvanantwerpen

Sep 15, 2022

Good questions @maxwnewcomer!

What you are asking about is the incrementality of the system: if one file changes, how much work has to be redone? At the minimum the stack graph for that file has to be recomputed, as you correctly said. But what about those paths? If complete (ref to def) paths were computed in the full graph (all files combined), it is a difficult problem to determine which paths were invalidated, which failed attempts (in the previous graph) may now succeed, etc. Instead of doing that, the path finding algorithm can be split up in two phases. One precomputes partial paths per file, and if the file doesn't change, neither do these. The second phase is stitching these part…

View full answer

hendrikvanantwerpen · 2022-09-15T12:31:08Z

hendrikvanantwerpen
Sep 15, 2022
Maintainer

Good questions @maxwnewcomer!

What you are asking about is the incrementality of the system: if one file changes, how much work has to be redone? At the minimum the stack graph for that file has to be recomputed, as you correctly said. But what about those paths? If complete (ref to def) paths were computed in the full graph (all files combined), it is a difficult problem to determine which paths were invalidated, which failed attempts (in the previous graph) may now succeed, etc. Instead of doing that, the path finding algorithm can be split up in two phases. One precomputes partial paths per file, and if the file doesn't change, neither do these. The second phase is stitching these partial paths together to form complete paths, which is done based on all partial paths from all files. But because the partial paths are limited to useful paths (starting at a reference, starting at a definition) this is more efficient that searching complete paths in the full graph from scratch.

Your next question is about storing and loading this information. The library currently only serializes the graph and complete paths, because these are necessary for visualization. Being able to load a stack graph would be a first step. Being able to store and load partial paths would be a next step. These partial paths would need to be loaded in with Database::add_partial_path and then resolution is done with Database::find_all_complete_partial_paths.

This could be something to contribute to the library. If you want to pick this up, I would suggest to start with deserialization of stack graphs, as the serialization part is already there. As a follow up you could do serialization & deserialization for partial paths.

3 replies

maxwnewcomer Sep 15, 2022
Author

@hendrikvanantwerpen I would love to give the deserialization a shot! I know I would personally find it useful. The only thing I'm confused about is the blog post implies that the stack graphs are "loaded", from presumably memory, then merged. How are those files stored to be eventually loaded? Or are they all recomputed on request and then cached for later use with no ability for a longer term storage solution?

As for the computation of paths, everything you said makes a lot of sense. The curious part of me wonders if there is a way to efficiently recompute paths. From my understanding there are a couple scenarios:

Reference from unchanged file to definition in changed file
- A reference can take the same path to a file, but point to a new definition node (quick compute)
- A reference path goes to a file, but the definition node has moved (implies that there will be a failed import in the running program)
Reference from changed file to definition in unchanged file
- Reference previously existed in file and can piggyback on previous path (quick)
- Reference wasn't in previous file, paths need to be recomputed (normal recompute of path)
Reference from changed file to definition in changed file
- Paths fully recomputed

The majority of new references point to existing definitions, or in the case of adding a feature, a new method/function is defined and used either internally or in another file.

I only bring this up because I've recently been a little inspired by some Linus Torvalds talks about the construction of git, where unchanged files point to old files and new files are stored as a changed file on a branch. I get that the paths are really a whole new beast in terms of complexity but it's fun to think that there could be a similar solution for stackgraphs.

Either way, I'll give deserialization a shot! Still a little new to serde, but I'm sure if I bang my head against it for a while something will happen. I really appreciate your thoughtful response.

hendrikvanantwerpen Sep 15, 2022
Maintainer

@maxwnewcomer The storing and loading does happen to a persistent store, but in a way that is specific to our production systems, so that is not part of this crate. As you pointed out, having a way to do so as part of the crate would be very useful.

There are probably ways to do so, and there is an upcoming paper at OOPSLA that does something more along those lines for scope graphs, a related formalism which inspired stack graphs. But there are more scenarios than the ones you've listed. For example, a path may start and end in unchanged files but an intermediary file is changed, or the introduction of a new file may change the resolution of a reference (through shadowing or by adding more results), for example in an unchanged intermediary file. Verifying which paths are still valid and which are not becomes very subtle. Precomputing as much as possible and recomputing what's left works well in practice for us 🤷🏻. The work in that paper is doing all the complicated stuff because their system does not easily allow precomputing so much information.

If you start working on it, feel free to open a draft PR. That will make it easier for us to follow along and you can use it to ask for feedback as well, even when still in progress.

maxwnewcomer Sep 15, 2022
Author

@hendrikvanantwerpen I'll have to check out that paper! And yes totally fair about the current storage situation. With the path stuff, I was more just trying to theory craft the most efficient system (and honestly it still was probably way off), no need to possibly break something in the process of fixing something that doesn't need to be fixed.

Once I get some actual progress on the deserialization I'll be sure to create the PR. Trying to get some of this stuff done in-between classes/work/free-time so it might take me a little longer than expected.

oppiliappan · 2023-01-25T10:22:29Z

oppiliappan
Jan 25, 2023

I see good value in being able to serialize and deserialize a stack-graph. @hendrikvanantwerpen am i right in understanding that the existing JsonStackGraph structure is a lossy serialization? The Position struct, for example, drops two fields; containing_line and trimmed_line. Or is there sufficient data to produce a StackGraph from a JsonStackGraph?

I would like to give this a go; writing a Serialize and Deserialize implementation for StackGraph, could you give me a leg up with that on a PR?

2 replies

hendrikvanantwerpen Jan 30, 2023
Maintainer

Hi @nerdypepper! Thanks for your interest and offer to contribute.

You're correct that there is only a serialization atm, and being able to deserialize would be very useful. Regarding the absence of source lines in the serialization: I think what is present would be enough to reconstruct a graph that supports all resolution. The actual source text would be lost, but this is often no problem. I would suggest to keep that as is and see if it gives any problems.

I'm happy to assist you in this. Let me know if your have questions or would like review of code. Might I suggest that your open a (draft) PR early, that will help discuss things easily.

I think json.rs will be the first place to start on this. Note that there's an open pr that adds serialization for more data types, but this should not interfere with working on what's already there.

oppiliappan Feb 1, 2023

@hendrikvanantwerpen thanks for the reply! ill open a PR, and we can continue the discussion there.

ghost · 2023-06-22T14:56:21Z

ghost
Jun 22, 2023

Note that currently deserialization of stack graphs is currently supported and built-in so implementing deserialization yourself is no longer needed.

This requires using the serde feature if I am not mistaken.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deserialization/Storage of StackGraphs #114

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deserialization/Storage of StackGraphs #114

Uh oh!

Uh oh!

maxwnewcomer Sep 14, 2022

Replies: 3 comments · 5 replies

Uh oh!

hendrikvanantwerpen Sep 15, 2022 Maintainer

Uh oh!

maxwnewcomer Sep 15, 2022 Author

Uh oh!

hendrikvanantwerpen Sep 15, 2022 Maintainer

Uh oh!

maxwnewcomer Sep 15, 2022 Author

Uh oh!

oppiliappan Jan 25, 2023

Uh oh!

hendrikvanantwerpen Jan 30, 2023 Maintainer

Uh oh!

oppiliappan Feb 1, 2023

Uh oh!

ghost Jun 22, 2023

maxwnewcomer
Sep 14, 2022

Replies: 3 comments 5 replies

hendrikvanantwerpen
Sep 15, 2022
Maintainer

maxwnewcomer Sep 15, 2022
Author

hendrikvanantwerpen Sep 15, 2022
Maintainer

maxwnewcomer Sep 15, 2022
Author

oppiliappan
Jan 25, 2023

hendrikvanantwerpen Jan 30, 2023
Maintainer

ghost
Jun 22, 2023