added closure table blog post 📰 (#196)

shivasurya · Feb 11, 2025 · 833da9a · 833da9a
1 parent 17b5623
commit 833da9a
Show file tree

Hide file tree

Showing 2 changed files with 165 additions and 0 deletions.
diff --git a/docs/public/assets/cpf-blog-3.png b/docs/public/assets/cpf-blog-3.png
diff --git a/docs/src/content/docs/blog/code-pathfinder-closure-table-hierarchical-queries.mdx b/docs/src/content/docs/blog/code-pathfinder-closure-table-hierarchical-queries.mdx
@@ -0,0 +1,165 @@
+---
+title: Closure Tables - Deconstructing Code Hierarchies
+description: "This blog post explores how closure tables simplify hierarchical queries in source code parsing with practical examples and pseudocode."
+template: splash
+author: "@sshivasurya"
+pubDate: "2025-02-10"
+---
+
+import PostHogLayout from '../../../layouts/PostHogLayout.astro';
+import { Card } from '@astrojs/starlight/components';
+
+<PostHogLayout>
+</PostHogLayout>
+
+
+  <Card title="">
+    <div style=" margin: 2rem auto; padding: 0 1.5rem; max-width: 800px;">
+    ![Code-Pathfinder - Closure Table Concept](/assets/cpf-blog-3.png)
+
+    ### Intro 
+    When working with complex hierarchical data—like the Abstract Syntax Trees (ASTs) generated by source code parsers—choosing the right data model can significantly impact the efficiency of your queries. Over few months, I've explored several solutions for representing hierarchies, each with its own advantages and drawbacks.
+
+    ### Exploring Hierarchical Solutions
+
+    - **Adjacency List:**  
+    The most straightforward approach, where each node stores a reference to its immediate parent. It's simple and easy to maintain, but it falls short when you need to retrieve an entire branch of the hierarchy without writing complex recursive queries.
+
+    - **Nested Sets:**  
+    This model represents hierarchies by storing left and right boundaries for each node, which can make querying entire subtrees very fast. However, nested sets become cumbersome when it comes to inserting or deleting nodes, as it often requires recalculating the boundaries for many nodes.
+
+    - **Materialized Path:**  
+    Here, each node stores the full path from the root to itself. This simplifies certain queries and can be efficient for read-heavy operations. On the downside, it can lead to redundancy and may require extra work to update paths when the hierarchy changes.
+
+    After evaluating these alternatives, I discovered the **closure table** approach—a solution that precomputes and stores every possible ancestor–descendant relationship. This method eliminates the need for recursive queries altogether and offers a simple, powerful way to navigate complex hierarchies.
+
+    ## The Example: A Java-like Class AST
+
+    Consider the following Java-like code:
+
+    ```java
+    public class Calculator {
+        public int compute(int x, int y) {
+            if (x > y) {
+                return add(x, y);
+            } else {
+                return multiply(x, y);
+            }
+        }
+
+        private int add(int a, int b) {
+            return a + b;
+        }
+
+        private int multiply(int a, int b) {
+            return a * b;
+        }
+    }
+    ```
+
+    This code can be represented by an AST with nodes for:
+    ```
+    •	Class Declaration (e.g., Calculator)
+        •	Method Declarations (e.g., compute, add, multiply)
+            •	Parameters for each method
+            •	Block Statements and Control Structures (e.g., the if-else statement)
+            •	Method Invocations (e.g., add(x, y) and multiply(x, y))
+            •	Expressions (e.g., binary expressions for comparisons and arithmetic)
+    ```
+
+    Each AST node is assigned a unique identifier. For example:
+
+    ```
+    •	Node 1: ClassDeclaration: Calculator
+        •	Node 2: MethodDeclaration: compute
+            •	Node 3: Parameter: int x
+            •	Node 4: Parameter: int y
+            •	Node 5: Block
+                •	Node 6: IfStatement
+                •	Node 7: Condition: (x > y)
+                    •	(Possibly further nested nodes)
+                    •	Node 8: ThenBlock
+                        •	Node 9: ReturnStatement
+                            •	Node 10: MethodInvocation: add(x, y)
+                                •	Node 11: Argument: x
+                                •	Node 12: Argument: y
+                •	Node 13: ElseBlock
+                    •	Node 14: ReturnStatement
+                        •	Node 15: MethodInvocation: multiply(x, y)
+                            •	Node 16: Argument: x
+                            •	Node 17: Argument: y
+    •	Node 18: MethodDeclaration: add
+        •	Node 19: Parameter: int a
+        •	Node 20: Parameter: int b
+        •	Node 21: Block
+            •	Node 22: ReturnStatement
+                •	Node 23: BinaryExpression: a + b
+    •	Node 24: MethodDeclaration: multiply
+        •	Node 25: Parameter: int a
+        •	Node 26: Parameter: int b
+        •	Node 27: Block
+            •	Node 28: ReturnStatement
+                •	Node 29: BinaryExpression: a * b
+    ```
+
+    ### The Closure Table Concept
+
+    A closure table stores every possible ancestor–descendant pair from the AST along with a depth value indicating how many levels separate the nodes. Here’s what that means in simple terms:
+	    -	**Self-Relationship**: Every node is its own ancestor at depth 0.
+	    -	**Direct and Indirect Relationships**: For instance, the Calculator class (Node 1) is not only the direct parent of the compute method (Node 2) but also an indirect ancestor of every node inside that method. Similarly, the compute method is an ancestor of its block (Node 5) and all nodes nested within that block (e.g., Node 10 for the method invocation).
+
+    A simplified excerpt of a closure table might look like this:
+
+    | ancestor_id | descendant_id | depth | Description |
+    |------------|---------------|--------|-------------|
+    | 1 | 1 | 0 | Calculator → Calculator |
+    | 1 | 2 | 1 | Calculator → compute |
+    | 1 | 10 | 3 | Calculator → MethodInvocation: add(x, y) |
+    | 2 | 2 | 0 | compute → compute |
+    | 2 | 10 | 2 | compute → MethodInvocation: add(x, y) |
+    | ... | ... | ... | ... |
+
+
+    By precomputing these relationships, closure tables enable fast and simple lookups. Instead of writing complex recursive queries to fetch all nodes under a given method, you can directly query the closure table for the desired relationships.
+
+    ### Overview of the Parsing Logic (Pseudocode)
+
+    Here’s a high-level pseudocode overview of how you might build the closure table from the AST:
+
+    ```
+    function buildClosureTable(root):
+        initialize closureTable as empty list
+
+    function traverse(node, ancestors):
+        // Record the self-relationship
+        add (node.id, node.id, 0) to closureTable
+
+        // For each ancestor, record the relationship to the current node
+        for each ancestor in ancestors with index i:
+            add (ancestor.id, node.id, i + 1) to closureTable
+
+        // Recurse over children, adding the current node to the ancestors list
+        for each child in node.children:
+            traverse(child, [node] + ancestors)
+
+    traverse(root, empty list)
+    return closureTable
+    ```
+
+    Explanation:
+	-	**Self-Relationship**: Each node is stored as an ancestor of itself (depth 0).
+	-	**Ancestor Relationships**: For every node, we record its relationship to each of its ancestors along with the depth (calculated as the index of the ancestor plus one).
+	-	**Recursion**: The function recursively processes each child node, passing along an updated list of ancestors.
+
+
+    Closure tables matter because they provide a simple yet powerful solution to the challenges of managing hierarchical data. By precomputing every ancestor–descendant relationship, closure tables eliminate the need for recursive queries and make it straightforward to navigate even the most intricate code structures. While other solutions like the adjacency list, nested sets, and materialized paths each have their uses, closure tables stand out as a robust, efficient alternative—perfect for any Code Pathfinder aiming to master the maze of source code hierarchies.
+
+    Happy coding—and may your path through the code always be clear! 🎉
+
+
+    ### Closing Note
+
+    Discover [Code-PathFinder](https://github.com/shivasurya/code-pathfinder), the open-source alternative to CodeQL—a powerful tool engineered to detect security vulnerabilities. Unlike grep-based scanners such as Semgrep or ast-grep, Code-PathFinder enables fine-tuning of queries to more effectively eliminate false positives, thanks to its advanced taint analysis and source-to-sink tracing capabilities. Give it a try, and if you encounter any bugs or have suggestions, please file an issue.
+      </div>
+  </Card>
+