Skip to content
github-actions[bot] edited this page Jun 21, 2025 · 1 revision

This document was generated from 'src/documentation/print-core-wiki.ts' on 2025-06-21, 15:53:46 UTC presenting an overview of flowR's core (v2.2.15, using R v4.5.0). Please do not edit this file/wiki page directly.

This wiki page provides an overview of the inner workings of flowR. It is mostly intended for developers that want to extend the capabilities of flowR and assumes knowledge of TypeScript and R. If you think parts of the wiki are missing, wrong, or outdated, please do not hesitate to open a new issue! In case you are new and want to develop for flowR, please check out the relevant Setup wiki page and the Contributing Guidelines.

Note

Essentially every step we explain here can be explored directly from flowR's REPL in an interactive fashion (see the Interface wiki page). We recommend to use commands like :parse or :dataflow* to explore the output of flowR using your own samples. As a quickstart you may use:

$ docker run -it --rm eagleoutice/flowr # or npm run flowr 
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :parse "x <- 1; print(x)"
Output
exprlist
├ expr
│ ├ expr
│ │ ╰ SYMBOL "x" (1:1)
│ ├ LEFT_ASSIGN "<-" (1:3─4)
│ ╰ expr
│   ╰ NUM_CONST "1" (1:6)
├ ; ";" (1:7)
╰ expr
  ├ expr
  │ ╰ SYMBOL_FUNCTION_CALL "print" (1:9─13)
  ├ ( "(" (1:14)
  ├ expr
  │ ╰ SYMBOL "x" (1:15)
  ╰ ) ")" (1:16)

Retrieves the AST from the RShell.

If you are brave (or desperate) enough, you can also try to use the --verbose option to be dumped with information about flowR's internals (please, never use this for benchmarking). See the Getting flowR to Talk section below for more information.

Pipelines and their Execution

At the core of every analysis by flowR is the PipelineExecutor class which takes a sequence of analysis steps (in the form of a Pipeline) and executes it on a given input. In general, these pipeline steps are analysis agnostic and may use arbitrary input and ordering. However, two important and predefined pipelines, the DEFAULT_DATAFLOW_PIPELINE and the TREE_SITTER_DATAFLOW_PIPELINE adequately cover the most common analysis steps (differentiated only by the Engine used).

Tip

You can hover over most links within these wiki pages to get access to the tsdoc comment of the respective element. The links should direct you to the up-to-date implementation.

Using the tree-sitter engine you can request a dataflow analysis of a sample piece of R code like the following:

const executor = new PipelineExecutor(TREE_SITTER_DATAFLOW_PIPELINE, {
	parser:  new TreeSitterExecutor(),
	request: requestFromInput('x <- 1; y <- x; print(y);')
});
const result = await executor.allRemainingSteps();

This is, roughly, what the replGetDataflow function does for the :dataflow REPL command when using the tree-sitter engine. We create a new PipelineExecutor with the TREE_SITTER_DATAFLOW_PIPELINE and then use allRemainingSteps to cause the execution of all contained steps (in general, pipelines can be executed step-by-step, but this is usually not required if you just want the result). requestFromInput is merely a convenience function to create a request object from a code string.

In general, however, most flowR-internal functions which are tasked with generating dataflow prefer the use of createDataflowPipeline as this function automatically selects the correct pipeline based on the engine used.

Understanding Pipeline Steps

Everything that complies to the IPipelineStep interface can be used as a step in a pipeline, with the most important definition being the processor function, which refers to the actual work performed by the step. For example, the STATIC_DATAFLOW step ultimately relies on the produceDataFlowGraph function to create a dataflow graph using the normalized AST of the program.

Shape of a Pipeline Step

Using code, you can provide an arbitrary pipeline step to the executor, as long as it implements the IPipelineStep interface:

  • IPipelineStep
    Defines what is to be known of a single step in a pipeline. It wraps around a single processor function, providing additional information. Steps will be executed synchronously, in-sequence, based on their dependencies .

    Defined at ./src/core/steps/pipeline-step.ts#L70
    /**
     * Defines what is to be known of a single step in a pipeline.
     * It wraps around a single {@link IPipelineStep#processor|processor} function, providing additional information.
     * Steps will be executed synchronously, in-sequence, based on their {@link IPipelineStep#dependencies|dependencies}.
     */
    export interface IPipelineStep<
        Name extends PipelineStepName = PipelineStepName,
        // eslint-disable-next-line -- by default, we assume nothing about the function shape
        Fn extends StepProcessingFunction = (...args: any[]) => any,
    > extends MergeableRecord, IPipelineStepOrder<Name> {
        /** Human-readable name of this step */
        readonly humanReadableName: string
        /** Human-readable description of this step */
        readonly description:       string
        /** The main processor that essentially performs the logic of this step */
        readonly processor:         (...input: Parameters<Fn>) => ReturnType<Fn>
        /** How to visualize the results of the respective step to the user? */
        readonly printer: {
            [K in StepOutputFormat]?: IPipelineStepPrinter<Fn, K, never[]>
        } & {
            // we always want to have an internal printer
            [StepOutputFormat.Internal]: InternalStepPrinter<Fn>
        }
        /**
         * Input configuration required to perform the respective steps.
         * Required inputs of dependencies do not have to, but can be repeated.
         * <p>
         * Use the pattern `undefined as unknown as T` to indicate that the value is required but not provided.
         */
        readonly requiredInput: object
    }

Every step may specify required inputs, ways of visualizing the output, and its dependencies using the IPipelineStepOrder interface. As the types may seem to be somewhat confusing or over-complicated, we recommend you to look at some existing steps, like the PARSE_WITH_R_SHELL_STEP or the STATIC_DATAFLOW step. The pipeline executor should do a good job of scheduling these steps (usually using a topological sort), and inferring the required inputs in the type system (have a look at the createPipeline function if you want to know more).

Note

Under the hood there is a step-subtype called a decoration. Such a step can be added to a pipeline to decorate the output of another one (e.g., making it more precise, re-adding debug info, ...). To mark a step as a decoration, you can use the decorates field in the IPipelineStepOrder interface. However, as such steps are currently not relevant for any of flowR's core analyses we will not go into detail here. It suffices to know how "real" steps work.

How flowR Produces Dataflow Graphs

This section focuses on the generation of a dataflow graph from a given R program, using the RShell Engine and hence the DEFAULT_DATAFLOW_PIPELINE. The tree-sitter engine uses the TREE_SITTER_DATAFLOW_PIPELINE), which replaces the parser with the integrated tree-sitter parser and hence uses a slightly adapted normalization step to produce a similar normalized AST. The dataflow graph should be the same for both engines (although tree-sitter is faster and may be able to parse more files).

Overview

Let's have a look at the definition of the pipeline:

  • DEFAULT_DATAFLOW_PIPELINE
    The default pipeline for working with flowR, including the dataflow step. See the DEFAULT_NORMALIZE_PIPELINE for the pipeline without the dataflow step and the DEFAULT_SLICE_AND_RECONSTRUCT_PIPELINE for the pipeline with slicing and reconstructing steps

    Defined at ./src/core/steps/pipeline/default-pipelines.ts#L30
    DEFAULT_DATAFLOW_PIPELINE = createPipeline(PARSE_WITH_R_SHELL_STEP, NORMALIZE, STATIC_DATAFLOW)

We can see that it relies on three steps:

  1. PARSE_WITH_R_SHELL_STEP (parsing): Uses the RShell to parse the input program.
    Its main function linked as the processor is the parseRequests function.
  2. NORMALIZE (normalization): Normalizes the AST produced by the parser (to create a normalized AST).
    Its main function linked as the processor is the normalize function.
  3. STATIC_DATAFLOW (dataflow): Produces the actual dataflow graph from the normalized AST.
    Its main function linked as the processor is the produceDataFlowGraph function.

To explore these steps, let's use the REPL with the (very simple and contrived) R code: x <- 1; print(x).

$ docker run -it --rm eagleoutice/flowr # or npm run flowr 
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :parse "x <- 1; print(x)"
Output
exprlist
├ expr
│ ├ expr
│ │ ╰ SYMBOL "x" (1:1)
│ ├ LEFT_ASSIGN "<-" (1:3─4)
│ ╰ expr
│   ╰ NUM_CONST "1" (1:6)
├ ; ";" (1:7)
╰ expr
  ├ expr
  │ ╰ SYMBOL_FUNCTION_CALL "print" (1:9─13)
  ├ ( "(" (1:14)
  ├ expr
  │ ╰ SYMBOL "x" (1:15)
  ╰ ) ")" (1:16)

This shows the ASCII-Art representation of the parse-tree of the R code x <- 1; print(x), as it is provided by the RShell. See the initCommand function for more information on how we request a parse.

R> :normalize* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IFREXG4gICAgbjcoW1wiUkV4cHJlc3Npb25MaXN0ICg3KVxuIFwiXSlcbiAgICBuMihbXCJSQmluYXJ5T3AgKDIpXG4jNjA7IzQ1O1wiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMFwifCBuMlxuICAgIG4wKFtcIlJTeW1ib2wgKDApXG54XCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLWxoc1wifCBuMFxuICAgIG4xKFtcIlJOdW1iZXIgKDEpXG4xXCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLXJoc1wifCBuMVxuICAgIG42KFtcIlJGdW5jdGlvbkNhbGwgKDYpXG5wcmludFwiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMVwifCBuNlxuICAgIG4zKFtcIlJTeW1ib2wgKDMpXG5wcmludFwiXSlcbiAgICBuNiAtLT58XCJjYWxsLW5hbWVcInwgbjNcbiAgICBuNShbXCJSQXJndW1lbnQgKDUpXG54XCJdKVxuICAgIG42IC0tPnxcImNhbGwtYXJndW1lbnQtMVwifCBuNVxuICAgIG40KFtcIlJTeW1ib2wgKDQpXG54XCJdKVxuICAgIG41IC0tPnxcImFyZy12YWx1ZVwifCBuNFxuIiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19

Following the link output should show the following:

flowchart TD
    n7(["RExpressionList (7)
 "])
    n2(["RBinaryOp (2)
#60;#45;"])
    n7 -->|"expr-list-child-0"| n2
    n0(["RSymbol (0)
x"])
    n2 -->|"binop-lhs"| n0
    n1(["RNumber (1)
1"])
    n2 -->|"binop-rhs"| n1
    n6(["RFunctionCall (6)
print"])
    n7 -->|"expr-list-child-1"| n6
    n3(["RSymbol (3)
print"])
    n6 -->|"call-name"| n3
    n5(["RArgument (5)
x"])
    n6 -->|"call-argument-1"| n5
    n4(["RSymbol (4)
x"])
    n5 -->|"arg-value"| n4

Loading

(The analysis required 4.8 ms (including parsing with the r-shell engine) within the generation environment.)

R> :dataflow* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IEJUXG4gICAgMXt7XCJgIzkxO1JOdW1iZXIjOTM7IDFcbiAgICAgICgxKVxuICAgICAgKjEuNipgXCJ9fVxuICAgIDBbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICgwKVxuICAgICAgKjEuMSpgXCJdXG4gICAgMltbXCJgIzkxO1JCaW5hcnlPcCM5MzsgIzYwOyM0NTtcbiAgICAgICgyKVxuICAgICAgKjEuMS02KlxuICAgICgwLCAxKWBcIl1dXG4gICAgYnVpbHQtaW46Xy1bXCJgQnVpbHQtSW46XG4jNjA7IzQ1O2BcIl1cbiAgICBzdHlsZSBidWlsdC1pbjpfLSBzdHJva2U6Z3JheSxmaWxsOmxpZ2h0Z3JheSxzdHJva2Utd2lkdGg6MnB4LG9wYWNpdHk6Ljg7XG4gICAgNChbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICg0KVxuICAgICAgKjEuMTUqYFwiXSlcbiAgICA2W1tcImAjOTE7UkZ1bmN0aW9uQ2FsbCM5MzsgcHJpbnRcbiAgICAgICg2KVxuICAgICAgKjEuOS0xNipcbiAgICAoNClgXCJdXVxuICAgIGJ1aWx0LWluOnByaW50W1wiYEJ1aWx0LUluOlxucHJpbnRgXCJdXG4gICAgc3R5bGUgYnVpbHQtaW46cHJpbnQgc3Ryb2tlOmdyYXksZmlsbDpsaWdodGdyYXksc3Ryb2tlLXdpZHRoOjJweCxvcGFjaXR5Oi44O1xuICAgIDAgLS0+fFwiZGVmaW5lZC1ieVwifCAxXG4gICAgMCAtLT58XCJkZWZpbmVkLWJ5XCJ8IDJcbiAgICAyIC0tPnxcImFyZ3VtZW50XCJ8IDFcbiAgICAyIC0tPnxcInJldHVybnMsIGFyZ3VtZW50XCJ8IDBcbiAgICAyIC0uLT58XCJyZWFkcywgY2FsbHNcInwgYnVpbHQtaW46Xy1cbiAgICBsaW5rU3R5bGUgNCBzdHJva2U6Z3JheTtcbiAgICA0IC0tPnxcInJlYWRzXCJ8IDBcbiAgICA2IC0tPnxcInJlYWRzLCByZXR1cm5zLCBhcmd1bWVudFwifCA0XG4gICAgNiAtLi0+fFwicmVhZHMsIGNhbGxzXCJ8IGJ1aWx0LWluOnByaW50XG4gICAgbGlua1N0eWxlIDcgc3Ryb2tlOmdyYXk7IiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19

Following the link output should show the following:

flowchart LR
    1{{"`#91;RNumber#93; 1
      (1)
      *1.6*`"}}
    0["`#91;RSymbol#93; x
      (0)
      *1.1*`"]
    2[["`#91;RBinaryOp#93; #60;#45;
      (2)
      *1.1-6*
    (0, 1)`"]]
    built-in:_-["`Built-In:
#60;#45;`"]
    style built-in:_- stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
    4(["`#91;RSymbol#93; x
      (4)
      *1.15*`"])
    6[["`#91;RFunctionCall#93; print
      (6)
      *1.9-16*
    (4)`"]]
    built-in:print["`Built-In:
print`"]
    style built-in:print stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
    0 -->|"defined-by"| 1
    0 -->|"defined-by"| 2
    2 -->|"argument"| 1
    2 -->|"returns, argument"| 0
    2 -.->|"reads, calls"| built-in:_-
    linkStyle 4 stroke:gray;
    4 -->|"reads"| 0
    6 -->|"reads, returns, argument"| 4
    6 -.->|"reads, calls"| built-in:print
    linkStyle 7 stroke:gray;
Loading

(The analysis required 7.6 ms (including parse and normalize, using the r-shell engine) within the generation environment.)

Tip

All of these commands accept file paths as well, so you can write longer R code within a file, and then pass the file path prefixed with file:// (e.g., file://test/testfiles/example.R) to the commands.

Especially when you are just starting with flowR, we recommend using the REPL to explore the output of the different steps.

Note

Maybe you are left with the question: What is tree-sitter doing differently? Expand the following to get more information!

And what changes with tree-sitter?

Essentially not much (from a user perspective, it does essentially everything and all differently under the hood)! Have a look at the Engines wiki page for more information on the differences between the engines. Below you can see the Repl commands for the tree-sitter engine (using --default-engine to set the engine to tree-sitter):

$ docker run -it --rm eagleoutice/flowr --default-engine tree-sitter # or npm run flowr -- --default-engine tree-sitter
flowR repl using flowR v2.2.15, R grammar v14 (tree-sitter engine)
R> :parse "x <- 1; print(x)"
Output
program
├ binary_operator
│ ├ identifier "x" (1:1─2)
│ ├ <- "<-" (1:3─5)
│ ╰ float "1" (1:6─7)
╰ call
  ├ identifier "print" (1:9─14)
  ╰ arguments
    ├ ( "(" (1:14─15)
    ├ argument
    │ ╰ identifier "x" (1:15─16)
    ╰ ) ")" (1:16─17)

This shows the ASCII-Art representation of the parse-tree of the R code x <- 1; print(x), as it is provided by the TreeSitterExecutor. See the Engines wiki page for more information on the differences between the engines.

R> :normalize* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IFREXG4gICAgbjcoW1wiUkV4cHJlc3Npb25MaXN0ICg3KVxuIFwiXSlcbiAgICBuMihbXCJSQmluYXJ5T3AgKDIpXG4jNjA7IzQ1O1wiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMFwifCBuMlxuICAgIG4wKFtcIlJTeW1ib2wgKDApXG54XCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLWxoc1wifCBuMFxuICAgIG4xKFtcIlJOdW1iZXIgKDEpXG4xXCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLXJoc1wifCBuMVxuICAgIG42KFtcIlJGdW5jdGlvbkNhbGwgKDYpXG5wcmludFwiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMVwifCBuNlxuICAgIG4zKFtcIlJTeW1ib2wgKDMpXG5wcmludFwiXSlcbiAgICBuNiAtLT58XCJjYWxsLW5hbWVcInwgbjNcbiAgICBuNShbXCJSQXJndW1lbnQgKDUpXG54XCJdKVxuICAgIG42IC0tPnxcImNhbGwtYXJndW1lbnQtMVwifCBuNVxuICAgIG40KFtcIlJTeW1ib2wgKDQpXG54XCJdKVxuICAgIG41IC0tPnxcImFyZy12YWx1ZVwifCBuNFxuIiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19

Following the link output should show the following:

flowchart TD
    n7(["RExpressionList (7)
 "])
    n2(["RBinaryOp (2)
#60;#45;"])
    n7 -->|"expr-list-child-0"| n2
    n0(["RSymbol (0)
x"])
    n2 -->|"binop-lhs"| n0
    n1(["RNumber (1)
1"])
    n2 -->|"binop-rhs"| n1
    n6(["RFunctionCall (6)
print"])
    n7 -->|"expr-list-child-1"| n6
    n3(["RSymbol (3)
print"])
    n6 -->|"call-name"| n3
    n5(["RArgument (5)
x"])
    n6 -->|"call-argument-1"| n5
    n4(["RSymbol (4)
x"])
    n5 -->|"arg-value"| n4

Loading

(The analysis required 5.6 ms (including parsing with the tree-sitter engine) within the generation environment.)

R> :dataflow* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IEJUXG4gICAgMXt7XCJgIzkxO1JOdW1iZXIjOTM7IDFcbiAgICAgICgxKVxuICAgICAgKjEuNipgXCJ9fVxuICAgIDBbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICgwKVxuICAgICAgKjEuMSpgXCJdXG4gICAgMltbXCJgIzkxO1JCaW5hcnlPcCM5MzsgIzYwOyM0NTtcbiAgICAgICgyKVxuICAgICAgKjEuMS02KlxuICAgICgwLCAxKWBcIl1dXG4gICAgYnVpbHQtaW46Xy1bXCJgQnVpbHQtSW46XG4jNjA7IzQ1O2BcIl1cbiAgICBzdHlsZSBidWlsdC1pbjpfLSBzdHJva2U6Z3JheSxmaWxsOmxpZ2h0Z3JheSxzdHJva2Utd2lkdGg6MnB4LG9wYWNpdHk6Ljg7XG4gICAgNChbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICg0KVxuICAgICAgKjEuMTUqYFwiXSlcbiAgICA2W1tcImAjOTE7UkZ1bmN0aW9uQ2FsbCM5MzsgcHJpbnRcbiAgICAgICg2KVxuICAgICAgKjEuOS0xNipcbiAgICAoNClgXCJdXVxuICAgIGJ1aWx0LWluOnByaW50W1wiYEJ1aWx0LUluOlxucHJpbnRgXCJdXG4gICAgc3R5bGUgYnVpbHQtaW46cHJpbnQgc3Ryb2tlOmdyYXksZmlsbDpsaWdodGdyYXksc3Ryb2tlLXdpZHRoOjJweCxvcGFjaXR5Oi44O1xuICAgIDAgLS0+fFwiZGVmaW5lZC1ieVwifCAxXG4gICAgMCAtLT58XCJkZWZpbmVkLWJ5XCJ8IDJcbiAgICAyIC0tPnxcImFyZ3VtZW50XCJ8IDFcbiAgICAyIC0tPnxcInJldHVybnMsIGFyZ3VtZW50XCJ8IDBcbiAgICAyIC0uLT58XCJyZWFkcywgY2FsbHNcInwgYnVpbHQtaW46Xy1cbiAgICBsaW5rU3R5bGUgNCBzdHJva2U6Z3JheTtcbiAgICA0IC0tPnxcInJlYWRzXCJ8IDBcbiAgICA2IC0tPnxcInJlYWRzLCByZXR1cm5zLCBhcmd1bWVudFwifCA0XG4gICAgNiAtLi0+fFwicmVhZHMsIGNhbGxzXCJ8IGJ1aWx0LWluOnByaW50XG4gICAgbGlua1N0eWxlIDcgc3Ryb2tlOmdyYXk7IiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19

Following the link output should show the following:

flowchart LR
    1{{"`#91;RNumber#93; 1
      (1)
      *1.6*`"}}
    0["`#91;RSymbol#93; x
      (0)
      *1.1*`"]
    2[["`#91;RBinaryOp#93; #60;#45;
      (2)
      *1.1-6*
    (0, 1)`"]]
    built-in:_-["`Built-In:
#60;#45;`"]
    style built-in:_- stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
    4(["`#91;RSymbol#93; x
      (4)
      *1.15*`"])
    6[["`#91;RFunctionCall#93; print
      (6)
      *1.9-16*
    (4)`"]]
    built-in:print["`Built-In:
print`"]
    style built-in:print stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
    0 -->|"defined-by"| 1
    0 -->|"defined-by"| 2
    2 -->|"argument"| 1
    2 -->|"returns, argument"| 0
    2 -.->|"reads, calls"| built-in:_-
    linkStyle 4 stroke:gray;
    4 -->|"reads"| 0
    6 -->|"reads, returns, argument"| 4
    6 -.->|"reads, calls"| built-in:print
    linkStyle 7 stroke:gray;
Loading

(The analysis required 1.4 ms (including parse and normalize, using the tree-sitter engine) within the generation environment.)

Parsing

The parsing step uses the RShell to parse the input program (or, of course, the TreeSitterExecutor when using the tree-sitter engine). To speed up the process, we use the initCommand function to compile the parsing function and rely on a custom serialization, which outputs the information in a CSV-like format. This means, that the :parse command actually kind-of lies to you, as it does pretty print the serialized version which looks more like the following (this uses the retrieveParseDataFromRCode function with the sample code x <- 1; print(x)):

Raw parse output for x <- 1; print(x)

For the code x <- 1; print(x):

[1,1,1,6,7,0,"expr",false,"x <- 1"],[1,1,1,1,1,3,"SYMBOL",true,"x"],[1,1,1,1,3,7,"expr",false,"x"],[1,3,1,4,2,7,"LEFT_ASSIGN",true,"<-"],[1,6,1,6,4,5,"NUM_CONST",true,"1"],[1,6,1,6,5,7,"expr",false,"1"],[1,7,1,7,6,0,"';'",true,";"],[1,9,1,16,19,0,"expr",false,"print(x)"],[1,9,1,13,10,12,"SYMBOL_FUNCTION_CALL",true,"print"],[1,9,1,13,12,19,"expr",false,"print"],[1,14,1,14,11,19,"'('",true,"("],[1,15,1,15,13,15,"SYMBOL",true,"x"],[1,15,1,15,15,19,"expr",false,"x"],[1,16,1,16,14,19,"')'",true,")"]

Beautiful, right? I thought so too! In fact, the output is a little bit nicer, when we put it into a table-format and add the appropriate headers:

Parse output in table format

For the code x <- 1; print(x):

line-start col-start line-end col-end id parent token type terminal text
1 1 1 6 7 0 expr false x <- 1
1 1 1 1 1 3 SYMBOL true x
1 1 1 1 3 7 expr false x
1 3 1 4 2 7 LEFT_ASSIGN true <-
1 6 1 6 4 5 NUM_CONST true 1
1 6 1 6 5 7 expr false 1
1 7 1 7 6 0 ';' true ;
1 9 1 16 19 0 expr false print(x)
1 9 1 13 10 12 SYMBOL_FUNCTION_CALL true print
1 9 1 13 12 19 expr false print
1 14 1 14 11 19 '(' true (
1 15 1 15 13 15 SYMBOL true x
1 15 1 15 15 19 expr false x
1 16 1 16 14 19 ')' true )

In fact, this data is merely what R's base::parse and utils::getParseData functions provide. We then use this data in the normalization step to create a normalized AST.

If you are interested in the raw token types that we may encounter, have a look at the RawRType enum.

Normalization

The normalization function normalize takes the output from the previous steps and uses the prepareParsedData and convertPreparedParsedData functions to first transform the serialized parsing output to an object. Next, normalizeRootObjToAst transforms this object to a normalized AST and decorateAst adds additional information to the AST (like roles, ids, depth, etc.). While looking at the mermaid visualization of such an AST is nice and usually sufficient, looking at the objects themselves shows you the full range of information the AST provides (all encompassed within the RNode type).

Let's have a look at the normalized AST for the sample code x <- 1; print(x) (please refer to the normalized AST wiki page for more information):

Normalized AST for x <- 1; print(x)
{
    "type": "RExpressionList",
    "children": [
        {
            "type": "RBinaryOp",
            "location": [
                1,
                3,
                1,
                4
            ],
            "lhs": {
                "type": "RSymbol",
                "location": [
                    1,
                    1,
                    1,
                    1
                ],
                "content": "x",
                "lexeme": "x",
                "info": {
                    "fullRange": [
                        1,
                        1,
                        1,
                        1
                    ],
                    "additionalTokens": [],
                    "id": 0,
                    "parent": 2,
                    "role": "binop-lhs",
                    "index": 0,
                    "nesting": 0
                }
            },
            "rhs": {
                "location": [
                    1,
                    6,
                    1,
                    6
                ],
                "lexeme": "1",
                "info": {
                    "fullRange": [
                        1,
                        6,
                        1,
                        6
                    ],
                    "additionalTokens": [],
                    "id": 1,
                    "parent": 2,
                    "role": "binop-rhs",
                    "index": 1,
                    "nesting": 0
                },
                "type": "RNumber",
                "content": {
                    "num": 1,
                    "complexNumber": false,
                    "markedAsInt": false
                }
            },
            "operator": "<-",
            "lexeme": "<-",
            "info": {
                "fullRange": [
                    1,
                    1,
                    1,
                    6
                ],
                "additionalTokens": [],
                "id": 2,
                "parent": 7,
                "nesting": 0,
                "index": 0,
                "role": "expr-list-child"
            }
        },
        {
            "type": "RFunctionCall",
            "named": true,
            "location": [
                1,
                9,
                1,
                13
            ],
            "lexeme": "print",
            "functionName": {
                "type": "RSymbol",
                "location": [
                    1,
                    9,
                    1,
                    13
                ],
                "content": "print",
                "lexeme": "print",
                "info": {
                    "fullRange": [
                        1,
                        9,
                        1,
                        16
                    ],
                    "additionalTokens": [],
                    "id": 3,
                    "parent": 6,
                    "role": "call-name",
                    "index": 0,
                    "nesting": 0
                }
            },
            "arguments": [
                {
                    "type": "RArgument",
                    "location": [
                        1,
                        15,
                        1,
                        15
                    ],
                    "lexeme": "x",
                    "value": {
                        "type": "RSymbol",
                        "location": [
                            1,
                            15,
                            1,
                            15
                        ],
                        "content": "x",
                        "lexeme": "x",
                        "info": {
                            "fullRange": [
                                1,
                                15,
                                1,
                                15
                            ],
                            "additionalTokens": [],
                            "id": 4,
                            "parent": 5,
                            "role": "arg-value",
                            "index": 0,
                            "nesting": 0
                        }
                    },
                    "info": {
                        "fullRange": [
                            1,
                            15,
                            1,
                            15
                        ],
                        "additionalTokens": [],
                        "id": 5,
                        "parent": 6,
                        "nesting": 0,
                        "index": 1,
                        "role": "call-argument"
                    }
                }
            ],
            "info": {
                "fullRange": [
                    1,
                    9,
                    1,
                    16
                ],
                "additionalTokens": [],
                "id": 6,
                "parent": 7,
                "nesting": 0,
                "index": 1,
                "role": "expr-list-child"
            }
        }
    ],
    "info": {
        "additionalTokens": [],
        "id": 7,
        "nesting": 0,
        "role": "root",
        "index": 0
    }
}

This is… a lot! We get the type from the RType enum, the lexeme, location information, an id, the children of the node, and their parents. While the normalized AST wiki page provides you with information on how to interpret this data, we will focus on how we get it from the table provided by the parsing step.

There are two important functions: normalizeRootObjToAst, which operates on the parse-output already transformed into a tree-like structure, and decorateAst, which adds additional information to the AST. Both follow a fold pattern. The fold is explicit for decorateAst, which directly relies on the foldAstStateful function, while normalizeRootObjToAst uses the fold-idiom but deviates in cases in which (for example) we require more information on other nodes to know what it should be normalized too.

Normalizing the Object

We have a handler for everything. For example tryNormalizeIfThen or tryNormalizeFor to handle if(x) y or for(i in 1:10) x constructs. All of these handlers contain many sanity checks to be sure that we talk to an RShell which we can handle (as assumptions may break with newer versions). These functions contain the keyword try as they may fail. For example, whenever they notice late into normalization that they should actually be a different construct (R is great). For single nodes, we use normalizeSingleNode which contains a catch-all for some edge-cases in the R grammar.

The output of just this pass is listed below (using the normalizeButNotDecorated function):

Ast for x <- 1; print(x) after the first normalization
{
    "type": "RExpressionList",
    "children": [
        {
            "type": "RBinaryOp",
            "location": [
                1,
                3,
                1,
                4
            ],
            "lhs": {
                "type": "RSymbol",
                "location": [
                    1,
                    1,
                    1,
                    1
                ],
                "content": "x",
                "lexeme": "x",
                "info": {
                    "fullRange": [
                        1,
                        1,
                        1,
                        1
                    ],
                    "additionalTokens": []
                }
            },
            "rhs": {
                "location": [
                    1,
                    6,
                    1,
                    6
                ],
                "lexeme": "1",
                "info": {
                    "fullRange": [
                        1,
                        6,
                        1,
                        6
                    ],
                    "additionalTokens": []
                },
                "type": "RNumber",
                "content": {
                    "num": 1,
                    "complexNumber": false,
                    "markedAsInt": false
                }
            },
            "operator": "<-",
            "lexeme": "<-",
            "info": {
                "fullRange": [
                    1,
                    1,
                    1,
                    6
                ],
                "additionalTokens": []
            }
        },
        {
            "type": "RFunctionCall",
            "named": true,
            "location": [
                1,
                9,
                1,
                13
            ],
            "lexeme": "print",
            "functionName": {
                "type": "RSymbol",
                "location": [
                    1,
                    9,
                    1,
                    13
                ],
                "content": "print",
                "lexeme": "print",
                "info": {
                    "fullRange": [
                        1,
                        9,
                        1,
                        16
                    ],
                    "additionalTokens": []
                }
            },
            "arguments": [
                {
                    "type": "RArgument",
                    "location": [
                        1,
                        15,
                        1,
                        15
                    ],
                    "lexeme": "x",
                    "value": {
                        "type": "RSymbol",
                        "location": [
                            1,
                            15,
                            1,
                            15
                        ],
                        "content": "x",
                        "lexeme": "x",
                        "info": {
                            "fullRange": [
                                1,
                                15,
                                1,
                                15
                            ],
                            "additionalTokens": []
                        }
                    },
                    "info": {
                        "fullRange": [
                            1,
                            15,
                            1,
                            15
                        ],
                        "additionalTokens": []
                    }
                }
            ],
            "info": {
                "fullRange": [
                    1,
                    9,
                    1,
                    16
                ],
                "additionalTokens": []
            }
        }
    ],
    "info": {
        "additionalTokens": []
    }
}

Decorating the AST

The decoration is comparatively trivial. We take the AST throw it into the decorateAst function (which again, handles each normalized node type) and get:

  1. The AST with ids, roles, and depth information (see the normalized AST wiki page for more information).
  2. A mapping of ids to nodes in the form of a AstIdMap object. This allows us to quickly access nodes by their id.

The ids used for the AST generation are arbitrary (usually created by the deterministicCountingIdGenerator) function) but unique and intentionally separated from the ids used by the R parser. For one, this detaches us from the Engine used, and secondly, it allows for much easier extension of the AST (e.g., when R files use base::source to include other R files). All ids conform to the NodeId type.

Dataflow Graph Generation

The core of the dataflow graph generation works as a "stateful fold", which uses the tree-like structure of the AST to combine the dataflow information of the children, while tracking the currently active variables and control flow information as a “backpack” (state). We use the produceDataFlowGraph function as an entry point to the dataflow generation (the actual fold entry is in processDataflowFor). The function is mainly backed by its processors object which maps each type in the normalized AST to an appropriate handler ("fold-function").

To understand these handlers, let's start with the simplest one, processUninterestingLeaf signals that we do not care about this node and just produce an empty dataflow information (using initializeCleanDataflowInformation). Looking at the function showcases the general structure of a processor:

Every processor has the same shape. It takes the normalized node (see the normalized AST for more information), and a DataflowProcessorInformation object which, as some kind of "backpack" carries global information to every handler. This information is to be used to create a DataflowInformation:

  • DataflowInformation
    The dataflow information is one of the fundamental structures we have in the dataflow analysis. It is continuously updated during the dataflow analysis and holds its current state for the respective subtree processed. Each processor during the dataflow analysis may use the information from its children to produce a new state of the dataflow information. You may initialize a new dataflow information with initializeCleanDataflowInformation .

    Defined at ./src/dataflow/info.ts#L89
    /**
     * The dataflow information is one of the fundamental structures we have in the dataflow analysis.
     * It is continuously updated during the dataflow analysis
     * and holds its current state for the respective subtree processed.
     * Each processor during the dataflow analysis may use the information from its children
     * to produce a new state of the dataflow information.
     *
     * You may initialize a new dataflow information with {@link initializeCleanDataflowInformation}.
     *
     * @see {@link DataflowCfgInformation} - the control flow aspects
     */
    export interface DataflowInformation extends DataflowCfgInformation {
        /**
         * References that have not been identified as read or write and will be so on higher processors.
         *
         * For example, when we analyze the `x` vertex in `x <- 3`, we will first create an unknown reference for `x`
         * as we have not yet seen the assignment!
         *
         * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ...
         */
        unknownReferences: readonly IdentifierReference[]
        /**
         * References which are read within the current subtree.
         *
         * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ...
         * */
        in:                readonly IdentifierReference[]
        /**
         * References which are written to within the current subtree
         *
         * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ...
         */
        out:               readonly IdentifierReference[]
        /** Current environments used for name resolution, probably updated on the next expression-list processing */
        environment:       REnvironmentInformation
        /** The current constructed dataflow graph */
        graph:             DataflowGraph
    }
    View more (DataflowCfgInformation)
    • DataflowCfgInformation
      The control flow information for the current DataflowInformation.

      Defined at ./src/dataflow/info.ts#L71
      /** The control flow information for the current DataflowInformation. */
      export interface DataflowCfgInformation {
          /** The entry node into the subgraph */
          entryPoint: NodeId,
          /** All already identified exit points (active 'return'/'break'/'next'-likes) of the respective structure. */
          exitPoints: readonly ExitPoint[]
      }

Essentially, these processors should use the dataflow information from their children combined with their own semantics to produce a new dataflow information to pass upwards in the fold. The DataflowInformation contains:

While all of them are essentially empty when processing an “uninteresting leaf”, handling a constant is slightly more interesting with processValue:

  • processValue

    Defined at ./src/dataflow/internal/process/process-value.ts#L8
    export function processValue<OtherInfo>({ info: { id } }: RNodeWithParent, data: DataflowProcessorInformation<OtherInfo>): DataflowInformation {
        return {
            unknownReferences: [],
            in:                [{ nodeId: id, name: undefined, controlDependencies: data.controlDependencies, type: ReferenceType.Constant }],
            out:               [],
            environment:       data.environment,
            graph:             new DataflowGraph(data.completeAst.idMap).addVertex({
                tag: VertexType.Value,
                id:  id,
                cds: data.controlDependencies
            }),
            exitPoints: [{ nodeId: id, type: ExitPointType.Default, controlDependencies: data.controlDependencies }],
            entryPoint: id
        };
    }

Please note, that we add the value vertex to the newly created dataflow graph, which holds a reference to the constant. If you are confused with the use of the ParentInformation type, this stems from the AST decoration and signals that we have a decorated RNode (which may have additional information in OtherInfo).

Yet again, this is not very interesting. When looking at the processors object you may be confused by many lines just mapping the node to the processAsNamedCall function. This is because during the dataflow analysis we actually "desugar" the AST, and treat syntax constructs like binary operators (e.g., x + y) as function calls (e.g. `+`(x, y)). We do this, because R does it the same way, and allows to even overwrite these operators (including if, <-, etc.) by their name. By treating them like R, as function calls, we get support for these overwrites for free, courtesy of flowR's call resolution.

But where are all the interesting things handled then? For that, we want to have a look at the built-in environment, which can be freely configured using flowR's configuration system. FlowR's heart and soul resides in the DefaultBuiltinConfig object, which is used to configure the built-in environment by mapping function names to BuiltInProcessorMapper functions. There you can find functions like processAccess which handles the (subset) access to a variable, or processForLoop which handles the primitive for loop construct (whenever it is not overwritten).

Just as an example, we want to have a look at the processRepeatLoop function, as it is one of the simplest built-in processors we have:

  • processRepeatLoop

    Defined at ./src/dataflow/internal/process/functions/call/built-in/built-in-repeat-loop.ts#L19
    export function processRepeatLoop<OtherInfo>(
        name: RSymbol<OtherInfo & ParentInformation>,
        args: readonly RFunctionArgument<OtherInfo & ParentInformation>[],
        rootId: NodeId,
        data: DataflowProcessorInformation<OtherInfo & ParentInformation>
    ): DataflowInformation {
        if(args.length !== 1 || args[0] === EmptyArgument) {
            dataflowLogger.warn(`Repeat-Loop ${name.content} does not have 1 argument, skipping`);
            return processKnownFunctionCall({ name, args, rootId, data, origin: 'default' }).information;
        }
    
        const unpacked = unpackArgument(args[0]);
        const { information, processedArguments } = processKnownFunctionCall({
            name,
            args:      unpacked ? [unpacked] : args,
            rootId,
            data,
            patchData: (d, i) => {
                if(i === 0) {
                    return { ...d, controlDependencies: [...d.controlDependencies ?? [], { id: name.info.id }] };
                }
                return d;
            },
            markAsNSE: [0],
            origin:    'builtin:repeat-loop'
        });
    
        const body = processedArguments[0];
        guard(body !== undefined, () => `Repeat-Loop ${name.content} has no body, impossible!`);
    
        linkCircularRedefinitionsWithinALoop(information.graph, produceNameSharedIdMap(findNonLocalReads(information.graph, [])), body.out);
    
        information.exitPoints = filterOutLoopExitPoints(information.exitPoints);
    
        return information;
    }

Similar to any other built-in processor, we get the name of the function call which caused us to land here, as well as the passed arguments. The rootId refers to what caused the call to happen (and is usually just the function call), while data is our good old backpack, carrying all the information we need to produce a dataflow graph.

After a couple of common sanity checks at the beginning which we use to check whether the repeat loop is used in a way that we expect, we start by issuing the fold continuation by processing its arguments. Given we expect repeat <body>, we expect only a single argument. During the processing we make sure to stitch in the correct control dependencies, adding the repeat loop to the mix. For just the repeat loop the stitching is actually not necessary, but this way the handling is consistent for all looping constructs.

Afterward, we take the processedArguments, perform another round of sanity checks and then use two special functions to apply the semantic effects of the repeat loop. We first use one of flowR's linkers to linkCircularRedefinitionsWithinALoop and then retrieve the active exit points with filterOutLoopExitPoints.

Feel free to have a look around and explore the other handlers for now. Each of them uses the results of its children alongside the active backpack to produce a new dataflow information.

Beyond the Dataflow Graph

Given the dataflow graph, you can do a lot more! You can issue queries to explore the graph, search for specific elements, or, for example, request a static backward slice. Of course, all of these endeavors work not just with the RShell but also with the tree-sitter engine.

Static Backward Slicing

The slicing is available as an extra step as you can see by inspecting he DEFAULT_SLICING_PIPELINE. Besides STATIC_SLICE it contains a NAIVE_RECONSTRUCT to print the slice as (executable) R code.

Your main point of interesting here is the staticSlicing function which relies on a modified breadth-first search to collect all nodes which are part of the slice. For more information on how the slicing works, please refer to the tool demonstration (Section 3.2), or the original master's thesis (Chapter 4).

You can explore the slicing using the REPL with the :slicer command:

$ docker run -it --rm eagleoutice/flowr # or npm run flowr 
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :slicer test/testfiles/example.R --criterion "12@product"
Output
product <- 1
N <- 10
for(i in 1:(N-1)) product <- product * i
product

Slice for the example file for the variable "prod" in line 12.

Helpful Things

Getting flowR to Talk

When using flowR from the CLI, you can use the --verbose option to get more information about what flowR is doing. While coding, however, you can use the function to set the minimum level of logs to be displayed (this works with the FlowrLogger abstraction). In general, you can configure the levels of individual logs, such as the general log (obtained with getActiveLog) or the parseLog. Please note that flowR makes no guarantees that log outputs are persistent across versions, and it is up to the implementors to provide sensible logging. If you are an implementor and want to add logging, please make sure there are no larger runtime impliciations when logging is disabled. Have a look at the expensiveTrace function for example, which uses a function to generate the log message only when the log level is reached.

Clone this wiki locally