Reduce stored similar paths and checked cycles #346

hendrikvanantwerpen · 2023-11-09T19:57:15Z

This PR makes a couple of changes to similar path detection to reduce memory usage and run time:

Observing that similar but different paths can only arise when the path split first, and then later joins, we only store paths for similar path detection when the two following conditions hold:
- The path was split before, that is, it was extended in two different ways. This is tracked in the path stitcher.
- The path currently ends in a possible join, that is, a node with multiple incoming edges. This could possibly refined to only consider actual candidates in the database.
Similarly, we can limit the times we check cycles as well. A path can only end in a cycle if one of the following holds:
- The start and end node are the same, that is, the path loops around.
- The end node has at least two incoming egdes in the graph (or partial paths in the database), that is, one in the path from the start node, and the other coming out of the loop.
Note that both require us to know the number of incoming edges or path of a node. This makes this optimization incompatible with stitching that dynamically add paths to the database from one phase to the next. Therefore, the optimization must be explicitly enabled using a flag.
Use smaller pre-allocated buckets. Some manual experiments using Collect stitching and database stats #326 suggests that most buckets in the similar path detector are small, and a few are much larger. By reducing the preallocated size of buckets from 8 to 3, we waste less space.

github-actions · 2023-11-09T20:06:02Z

Performance Summary

Comparing base 1acebc7 with head de82b3c on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                        :#                                              
     |                        :#         :                                    
     |                        :#         :                                    
     |                        :#         :                                    
     |                        :#::       ::::                                 
     |                        :#:        :: :                                 
     |                     :  :#:     :  :: :                                 
     |                     :  :#:     :  :: :                                 
     |                     ::::#:     ::::: :                        :        
     |                     :: :#:     :: :: :   :                  : :        
     |                     :: :#:    ::: :: :   :::          ::    :::        
     |                 @ @@:: :#:    ::: :: :   ::    ::@:: ::     :::  ::::: 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   61.49

github-actions · 2023-11-10T10:17:49Z

Performance Summary

Comparing base 1acebc7 with head c6badc5 on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.275^                                             ::#                        
     |                                             : #                        
     |                                             : #                        
     |                                             : #:                       
     |                                             : #:                     ::
     |                                             : #:                     ::
     |                :                         :: : #:                     ::
     |                :         @               : :: #:                  @::::
     |                ::::      @:::            : :: #:                 :@: ::
     |              :::: :    : @: :            : :: #:                ::@: ::
     |              : :: :    ::@: :   :        : :: #:               :::@: ::
     |            :@: :: :    ::@: :   :  :: :::: :: #:      :  :@: : :::@: ::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   88.25

github-actions · 2023-11-11T17:32:25Z

Performance Summary

Comparing base 1acebc7 with head 52d7bd5 on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
2.850^                                                                      :#
     |                                                                      :#
     |                                                                      :#
     |                                                                      :#
     |                                                                      :#
     |                                                                      :#
     |                                                @                 @::::#
     |                                             :::@                 @:  :#
     |                                         :::::  @               ::@:  :#
     |                                         :  ::  @               : @:  :#
     |                                 ::@ : :::  ::  @          :   :: @:  :#
     |             @@::@::     :::    :: @:::: :  ::  @       ::::::::: @:  :#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   77.33

github-actions · 2023-11-13T10:35:46Z

Performance Summary

Comparing base 1acebc7 with head d7bb4ad on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    MB
717.8^                                                                    #   
     |                                                  @                 #   
     |                                                 @@                :#   
     |                             @                  :@@               ::#:  
     |                             @                  :@@               ::#:  
     |                     ::    ::@                  :@@             @:::#:  
     |                    :: ::::: @                :::@@             @:::#:  
     |                    :: : ::: @       ::       : :@@             @:::#:: 
     |                    :: : ::: @  :   ::        : :@@:            @:::#:: 
     |                   @:: : ::: @ :::::::       @: :@@:           :@:::#:: 
     |                   @:: : ::: @ ::: :::  : @  @: :@@: ::@  ::   :@:::#:: 
     |                   @:: : ::: @:::: ::: :::@  @: :@@: : @  : @: :@:::#:::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   50.22

github-actions · 2023-11-13T13:33:01Z

Performance Summary

Comparing base 1acebc7 with head e535492 on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    MB
715.4^                                                                     #  
     |                                                   @                 #  
     |                                                   @                :#  
     |                                                @@:@               @:#: 
     |                                                @ :@              :@:#: 
     |                           :::                 @@ :@             ::@:#: 
     |                    @@    :::                  @@ :@:            ::@:#: 
     |                    @@ :  :::         :        @@ :@:           :::@:#: 
     |                    @@:::::::        ::        @@ :@:           :::@:#: 
     |                   @@@::: ::: : :  ::::       :@@ :@:           :::@:#: 
     |                   @@@::: ::: : :::: :::  ::  :@@ :@:     :::: ::::@:#: 
     |                   @@@::: ::: :::: : ::::::  ::@@ :@::::: : :: ::::@:#: 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   49.37

dcreager · 2023-11-13T14:56:06Z

Note that both require us to know the number of incoming edges or path of a node. This makes this optimization incompatible with stitching that dynamically add paths to the database from one phase to the next. Therefore, the optimization must be explicitly enabled using a flag.

As follow-on work I think we can relax this restriction, by tracking the in-degree of each node and path at construction time, and storing that information in our storage layer. We would then propagate that through in the paths that we load into a Database during multi-phase stitching. I think that would be sound, since there are limited places where a path can cross into a file. So we would pessimistically assume that the root node, and any node that is ever pushed onto the scope stack, have in-degree > 1 and would need to be tracked. But if any other node has in-degree 1 at path construction time, we know that it cannot ever "grow" additional inbound edges during stitching.

Does that sound right?

dcreager · 2023-11-13T15:02:21Z

stack-graphs/src/stitching.rs

-    next_iteration: (VecDeque<PartialPath>, VecDeque<AppendingCycleDetector<H>>),
+    next_iteration: (
+        VecDeque<PartialPath>,
+        VecDeque<AppendingCycleDetector<H>>,
+        VecDeque<bool>,
+    ),


In the future, if this grows another field, we'll probably want to turn this into a named struct with named fields instead of a tuple

Probably a good idea, yes :) It's only internal, so it doesn't pollute the API, but it becoming a bit unwieldy like this :P

hendrikvanantwerpen · 2023-11-13T15:29:50Z

Note that both require us to know the number of incoming edges or path of a node. This makes this optimization incompatible with stitching that dynamically add paths to the database from one phase to the next. Therefore, the optimization must be explicitly enabled using a flag.

As follow-on work I think we can relax this restriction, by tracking the in-degree of each node and path at construction time, and storing that information in our storage layer. We would then propagate that through in the paths that we load into a Database during multi-phase stitching. I think that would be sound, since there are limited places where a path can cross into a file. So we would pessimistically assume that the root node, and any node that is ever pushed onto the scope stack, have in-degree > 1 and would need to be tracked. But if any other node has in-degree 1 at path construction time, we know that it cannot ever "grow" additional inbound edges during stitching.

Does that sound right?

Yes, I think that is correct. I even considered this, but backed off once I realized I'd have to go through the storage layer as well to get this to work.

Still, I think this might be quite a big win, because the initial partial path finding during indexing will build paths for every clickable node in the graph. At query time (where we don't yet have the optimization available) we only look at a single or a few references.

Writing this, I'm wondering if serializing the partial path finding (i.e., do every initial path separately) might be a good idea as well? We don't share any work between each initial path, so it might only make a difference in max in-flight paths (and thus memory profile) but not in run time.

robrix

I hate to heap more work on you when you're right at the finish line, so I'll describe my feedback merely as suggestions—take 'em or leave 'em, and merge when ready. Either way, good work!

robrix · 2023-11-13T15:28:55Z

stack-graphs/src/cycles.rs

+        while idx < possibly_similar_paths.len() {
+            match cmp(arena, path, &possibly_similar_paths[idx]) {
+                Some(Ordering::Less) => {
+                    // the new path is betetr, remove the old one


Suggested change

// the new path is betetr, remove the old one

// the new path is better, remove the old one

robrix · 2023-11-13T15:31:28Z

stack-graphs/src/cycles.rs

+                Some(Ordering::Less) => {
+                    // the new path is betetr, remove the old one
+                    possibly_similar_paths.remove(idx);
+                    // keep `idx` which now points to the next element


I don't love mutating a collection while iterating through it, but if we're going to do so, this seems like the reasonable way to do so. That said, it could probably use a comment above the loop noting what's going on here (and why).

Good suggestion! In 07ca85d.

dcreager · 2023-11-13T15:36:16Z

stack-graphs/src/graph.rs

@@ -1442,9 +1449,10 @@ pub struct StackGraph {
    pub(crate) nodes: Arena<Node>,
    pub(crate) source_info: SupplementalArena<Node, SourceInfo>,
    node_id_handles: NodeIDHandles,
-    outgoing_edges: SupplementalArena<Node, SmallVec<[OutgoingEdge; 8]>>,
+    outgoing_edges: SupplementalArena<Node, SmallVec<[OutgoingEdge; 4]>>,
+    incoming_edges: SupplementalArena<Node, u32>,


Given the other discussion about relaxing the stitching restriction, you might consider not tracking the actual in-degree of each node, but instead tracking something like

#[repr(u8)] enum Degree { Zero, One, Multiple, }

That would cut down 4× on the storage space of this new arena and would eventually present a cleaner API if we decide to track this info through the storage layer

That's a nice idea! In cd58abb.

dcreager · 2023-11-13T15:41:00Z

Writing this, I'm wondering if serializing the partial path finding (i.e., do every initial path separately) might be a good idea as well? We don't share any work between each initial path, so it might only make a difference in max in-flight paths (and thus memory profile) but not in run time.

I don't have a clear intuition how that would play out. You're right that it would reduce the max memory used. It might increase the runtime a bit, depending on how many more times that causes us to cross the Go/Rust FFI boundary. (Though it also might not, if we're doing enough work on the Rust side to drown out the Go FFI overhead.)

There's also the cost of parsing each partial path from the storage layer and loading those into the Rust-side data structures.

Maybe a middle ground where we serialize the processing of each partial path with a new Database, but use the same StackGraph and PartialPaths instances?

hendrikvanantwerpen · 2023-11-13T16:28:39Z

I don't have a clear intuition how that would play out.

I'll create a follow-up PR to discuss, then.

github-actions · 2023-11-13T16:37:14Z

Performance Summary

Comparing base 1acebc7 with head cd58abb on typescript_benchmark benchmark. For details see workflow artifacts. Note that performance is tested on the last commits with changes in stack-graphs, not on every commit.

Before

--------------------------------------------------------------------------------
Command:            base/target/release/tree-sitter-stack-graphs-typescript index -D base.sqlite --max-file-time=30 --hide-error-details -- base/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 base-perf-results/perf.out
--------------------------------------------------------------------------------


    GB
4.581^                     :#                                                 
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#            :                                    
     |                     :#::::        ::::                                 
     |                     :#: ::        :: :                                 
     |                   : :#: ::     :  :: :                                 
     |                   : :#: ::     :  :: :    :                            
     |                   :::#: ::     ::::: :    :                  :         
     |                  @:::#: ::     :: :: :    :                  :         
     |                  @:::#: ::     :: :: :   ::           :    :::         
     |                @ @:::#: ::    ::: :: :  :::   :@:: :  :    :::  :::: : 
   0 +----------------------------------------------------------------------->Gi
     0                                                                   67.69

After

--------------------------------------------------------------------------------
Command:            head/target/release/tree-sitter-stack-graphs-typescript index -D head.sqlite --max-file-time=30 --hide-error-details -- head/data/typescript_benchmark
Massif arguments:   --massif-out-file=perf.out
ms_print arguments: --x=72 --y=12 head-perf-results/perf.out
--------------------------------------------------------------------------------


    MB
715.3^                                                                   #    
     |                                              @@                   #    
     |                                              @     ::             #    
     |                          @@                @:@     :            ::#:   
     |                          @                 @:@     :            ::#:   
     |                         :@                 @:@     :           :::#:   
     |                   :    ::@                 @:@    @:          ::::#: : 
     |                   :   :::@       ::      ::@:@ :: @:          ::::#::::
     |                  @::: :::@      ::       : @:@ :  @:          ::::#::::
     |                  @:: ::::@   : :::       : @:@ : :@:         :::::#::::
     |                  @:: ::::@  @:::::  @ : :: @:@ : :@:  :  :   :::::#::::
     |                @@@:: ::::@  @: ::: :@:: :: @:@ : :@: ::@ :::@:::::#::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   54.26

hendrikvanantwerpen self-assigned this Nov 9, 2023

hendrikvanantwerpen added 4 commits November 11, 2023 18:25

Only keep the best path in similar path detection

c41dc0e

Use much smaller buckets

397942b

Restrict similar path detection to split paths

8469415

Restrict similar path detection to paths that can converge

52d7bd5

hendrikvanantwerpen force-pushed the leaner-similar-paths-detection branch from c6badc5 to 52d7bd5 Compare November 11, 2023 17:26

hendrikvanantwerpen added the stack-graphs label Nov 11, 2023

hendrikvanantwerpen mentioned this pull request Nov 11, 2023

Collect stitching and database stats #326

Merged

hendrikvanantwerpen marked this pull request as ready for review November 11, 2023 17:59

hendrikvanantwerpen requested a review from a team as a code owner November 11, 2023 17:59

hendrikvanantwerpen added 2 commits November 12, 2023 21:41

Only check for cycles if loops can actually occur

2596ad2

Only apply join count optimization if explicitly enabled

d7bb4ad

hendrikvanantwerpen changed the title ~~Only keep best path in similar path detection~~ Reduce stored similar paths and checked cycles Nov 13, 2023

Reduce preallocated buffers

e535492

hendrikvanantwerpen mentioned this pull request Nov 13, 2023

Add support for stitcher configuration #321

Merged

dcreager approved these changes Nov 13, 2023

View reviewed changes

robrix approved these changes Nov 13, 2023

View reviewed changes

dcreager reviewed Nov 13, 2023

View reviewed changes

hendrikvanantwerpen added 2 commits November 13, 2023 17:01

Add comment warning for in-loop modification

07ca85d

Replace incoming count with degree

cd58abb

hendrikvanantwerpen merged commit 270bb1b into main Nov 13, 2023

hendrikvanantwerpen deleted the leaner-similar-paths-detection branch November 13, 2023 16:54

hendrikvanantwerpen mentioned this pull request Nov 13, 2023

Serialized instead of parallel path finding #348

Closed

	// the new path is betetr, remove the old one
	// the new path is better, remove the old one

Reduce stored similar paths and checked cycles #346

Reduce stored similar paths and checked cycles #346

Uh oh!

Conversation

hendrikvanantwerpen commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 9, 2023

Performance Summary

Uh oh!

github-actions bot commented Nov 10, 2023

Performance Summary

Uh oh!

github-actions bot commented Nov 11, 2023

Performance Summary

Uh oh!

github-actions bot commented Nov 13, 2023

Performance Summary

Uh oh!

github-actions bot commented Nov 13, 2023

Performance Summary

Uh oh!

dcreager commented Nov 13, 2023

Uh oh!

dcreager Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen commented Nov 13, 2023

Uh oh!

robrix left a comment

Choose a reason for hiding this comment

Uh oh!

robrix Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

robrix Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

dcreager Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

dcreager commented Nov 13, 2023

Uh oh!

hendrikvanantwerpen commented Nov 13, 2023

Uh oh!

github-actions bot commented Nov 13, 2023

Performance Summary

Uh oh!

Uh oh!

hendrikvanantwerpen commented Nov 9, 2023 •

edited

Loading