-
Notifications
You must be signed in to change notification settings - Fork 7
Core
This document was generated from 'src/documentation/print-core-wiki.ts' on 2025-06-21, 15:53:46 UTC presenting an overview of flowR's core (v2.2.15, using R v4.5.0). Please do not edit this file/wiki page directly.
This wiki page provides an overview of the inner workings of flowR. It is mostly intended for developers that want to extend the capabilities of flowR and assumes knowledge of TypeScript and R. If you think parts of the wiki are missing, wrong, or outdated, please do not hesitate to open a new issue! In case you are new and want to develop for flowR, please check out the relevant Setup wiki page and the Contributing Guidelines.
Note
Essentially every step we explain here can be explored directly from flowR's REPL in an interactive fashion (see the Interface wiki page).
We recommend to use commands like :parse
or :dataflow*
to explore the output of flowR using your own samples.
As a quickstart you may use:
$ docker run -it --rm eagleoutice/flowr # or npm run flowr
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :parse "x <- 1; print(x)"
Output
exprlist
├ expr
│ ├ expr
│ │ ╰ SYMBOL "x" (1:1)
│ ├ LEFT_ASSIGN "<-" (1:3─4)
│ ╰ expr
│ ╰ NUM_CONST "1" (1:6)
├ ; ";" (1:7)
╰ expr
├ expr
│ ╰ SYMBOL_FUNCTION_CALL "print" (1:9─13)
├ ( "(" (1:14)
├ expr
│ ╰ SYMBOL "x" (1:15)
╰ ) ")" (1:16)
Retrieves the AST from the RShell
.
If you are brave (or desperate) enough, you can also try to use the --verbose
option to be dumped with information about flowR's internals (please, never use this for benchmarking).
See the Getting flowR to Talk section below for more information.
- Pipelines and their Execution
- How flowR Produces Dataflow Graphs
- Beyond the Dataflow Graph
- Getting flowR to Talk
At the core of every analysis by flowR is the PipelineExecutor
class which takes a sequence of analysis steps (in the form of a Pipeline
) and executes it
on a given input. In general, these pipeline steps are analysis agnostic and may use arbitrary input and ordering. However, two important and predefined pipelines,
the DEFAULT_DATAFLOW_PIPELINE
and the TREE_SITTER_DATAFLOW_PIPELINE
adequately cover the most common analysis steps
(differentiated only by the Engine used).
Tip
You can hover over most links within these wiki pages to get access to the tsdoc comment of the respective element. The links should direct you to the up-to-date implementation.
Using the tree-sitter
engine you can request a dataflow analysis of a sample piece of R code like the following:
const executor = new PipelineExecutor(TREE_SITTER_DATAFLOW_PIPELINE, {
parser: new TreeSitterExecutor(),
request: requestFromInput('x <- 1; y <- x; print(y);')
});
const result = await executor.allRemainingSteps();
This is, roughly, what the replGetDataflow
function does for the :dataflow
REPL command when using the tree-sitter
engine.
We create a new PipelineExecutor
with the TREE_SITTER_DATAFLOW_PIPELINE
and then use allRemainingSteps
to cause the execution of all contained steps (in general, pipelines can be executed step-by-step, but this is usually not required if you just want the result).
requestFromInput
is merely a convenience function to create a request object from a code string.
In general, however, most flowR-internal functions which are tasked with generating dataflow prefer the use of createDataflowPipeline
as this function
automatically selects the correct pipeline based on the engine used.
Everything that complies to the IPipelineStep
interface can be used as a step in a pipeline, with the most important definition being the
processor
function, which refers to the actual work performed by the step.
For example, the STATIC_DATAFLOW
step ultimately relies on the produceDataFlowGraph
function to create a dataflow graph
using the normalized AST of the program.
Using code, you can provide an arbitrary pipeline step to the executor, as long as it implements the IPipelineStep
interface:
-
IPipelineStep
Defines what is to be known of a single step in a pipeline. It wraps around a singleprocessor
function, providing additional information. Steps will be executed synchronously, in-sequence, based on theirdependencies
.Defined at ./src/core/steps/pipeline-step.ts#L70
/** * Defines what is to be known of a single step in a pipeline. * It wraps around a single {@link IPipelineStep#processor|processor} function, providing additional information. * Steps will be executed synchronously, in-sequence, based on their {@link IPipelineStep#dependencies|dependencies}. */ export interface IPipelineStep< Name extends PipelineStepName = PipelineStepName, // eslint-disable-next-line -- by default, we assume nothing about the function shape Fn extends StepProcessingFunction = (...args: any[]) => any, > extends MergeableRecord, IPipelineStepOrder<Name> { /** Human-readable name of this step */ readonly humanReadableName: string /** Human-readable description of this step */ readonly description: string /** The main processor that essentially performs the logic of this step */ readonly processor: (...input: Parameters<Fn>) => ReturnType<Fn> /** How to visualize the results of the respective step to the user? */ readonly printer: { [K in StepOutputFormat]?: IPipelineStepPrinter<Fn, K, never[]> } & { // we always want to have an internal printer [StepOutputFormat.Internal]: InternalStepPrinter<Fn> } /** * Input configuration required to perform the respective steps. * Required inputs of dependencies do not have to, but can be repeated. * <p> * Use the pattern `undefined as unknown as T` to indicate that the value is required but not provided. */ readonly requiredInput: object }
Every step may specify required inputs, ways of visualizing the output, and its dependencies using the IPipelineStepOrder
interface.
As the types may seem to be somewhat confusing or over-complicated, we recommend you to look at some existing steps, like
the PARSE_WITH_R_SHELL_STEP
or the STATIC_DATAFLOW
step.
The pipeline executor should do a good job of scheduling these steps (usually using a topological sort), and inferring the required inputs in the type system (have a look at the createPipeline
function if you want to know more).
Note
Under the hood there is a step-subtype called a decoration. Such a step can be added to a pipeline to decorate the output of another one (e.g., making it more precise, re-adding debug info, ...).
To mark a step as a decoration, you can use the decorates
field in the IPipelineStepOrder
interface.
However, as such steps are currently not relevant for any of flowR's core analyses we will not go into detail here. It suffices to know how "real" steps work.
This section focuses on the generation of a dataflow graph from a given R program, using the RShell Engine and hence the
DEFAULT_DATAFLOW_PIPELINE
. The tree-sitter
engine uses the TREE_SITTER_DATAFLOW_PIPELINE
),
which replaces the parser with the integrated tree-sitter parser and hence uses a slightly adapted normalization step to produce a similar normalized AST.
The dataflow graph should be the same for both engines (although tree-sitter
is faster and may be able to parse more files).
Let's have a look at the definition of the pipeline:
-
DEFAULT_DATAFLOW_PIPELINE
The default pipeline for working with flowR, including the dataflow step. See theDEFAULT_NORMALIZE_PIPELINE
for the pipeline without the dataflow step and theDEFAULT_SLICE_AND_RECONSTRUCT_PIPELINE
for the pipeline with slicing and reconstructing stepsDefined at ./src/core/steps/pipeline/default-pipelines.ts#L30
DEFAULT_DATAFLOW_PIPELINE = createPipeline(PARSE_WITH_R_SHELL_STEP, NORMALIZE, STATIC_DATAFLOW)
We can see that it relies on three steps:
-
PARSE_WITH_R_SHELL_STEP (parsing): Uses the
RShell
to parse the input program.
Its main function linked as the processor is the parseRequests function. -
NORMALIZE (normalization): Normalizes the AST produced by the parser (to create a normalized AST).
Its main function linked as the processor is the normalize function. -
STATIC_DATAFLOW (dataflow): Produces the actual dataflow graph from the normalized AST.
Its main function linked as the processor is the produceDataFlowGraph function.
To explore these steps, let's use the REPL with the (very simple and contrived) R code: x <- 1; print(x)
.
$ docker run -it --rm eagleoutice/flowr # or npm run flowr
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :parse "x <- 1; print(x)"
Output
exprlist
├ expr
│ ├ expr
│ │ ╰ SYMBOL "x" (1:1)
│ ├ LEFT_ASSIGN "<-" (1:3─4)
│ ╰ expr
│ ╰ NUM_CONST "1" (1:6)
├ ; ";" (1:7)
╰ expr
├ expr
│ ╰ SYMBOL_FUNCTION_CALL "print" (1:9─13)
├ ( "(" (1:14)
├ expr
│ ╰ SYMBOL "x" (1:15)
╰ ) ")" (1:16)
This shows the ASCII-Art representation of the parse-tree of the R code x <- 1; print(x)
, as it is provided by the RShell
. See the initCommand
function for more information on how we request a parse.
R> :normalize* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IFREXG4gICAgbjcoW1wiUkV4cHJlc3Npb25MaXN0ICg3KVxuIFwiXSlcbiAgICBuMihbXCJSQmluYXJ5T3AgKDIpXG4jNjA7IzQ1O1wiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMFwifCBuMlxuICAgIG4wKFtcIlJTeW1ib2wgKDApXG54XCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLWxoc1wifCBuMFxuICAgIG4xKFtcIlJOdW1iZXIgKDEpXG4xXCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLXJoc1wifCBuMVxuICAgIG42KFtcIlJGdW5jdGlvbkNhbGwgKDYpXG5wcmludFwiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMVwifCBuNlxuICAgIG4zKFtcIlJTeW1ib2wgKDMpXG5wcmludFwiXSlcbiAgICBuNiAtLT58XCJjYWxsLW5hbWVcInwgbjNcbiAgICBuNShbXCJSQXJndW1lbnQgKDUpXG54XCJdKVxuICAgIG42IC0tPnxcImNhbGwtYXJndW1lbnQtMVwifCBuNVxuICAgIG40KFtcIlJTeW1ib2wgKDQpXG54XCJdKVxuICAgIG41IC0tPnxcImFyZy12YWx1ZVwifCBuNFxuIiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19
Following the link output should show the following:
flowchart TD
n7(["RExpressionList (7)
"])
n2(["RBinaryOp (2)
#60;#45;"])
n7 -->|"expr-list-child-0"| n2
n0(["RSymbol (0)
x"])
n2 -->|"binop-lhs"| n0
n1(["RNumber (1)
1"])
n2 -->|"binop-rhs"| n1
n6(["RFunctionCall (6)
print"])
n7 -->|"expr-list-child-1"| n6
n3(["RSymbol (3)
print"])
n6 -->|"call-name"| n3
n5(["RArgument (5)
x"])
n6 -->|"call-argument-1"| n5
n4(["RSymbol (4)
x"])
n5 -->|"arg-value"| n4
(The analysis required 4.8 ms (including parsing with the r-shell engine) within the generation environment.)
R> :dataflow* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IEJUXG4gICAgMXt7XCJgIzkxO1JOdW1iZXIjOTM7IDFcbiAgICAgICgxKVxuICAgICAgKjEuNipgXCJ9fVxuICAgIDBbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICgwKVxuICAgICAgKjEuMSpgXCJdXG4gICAgMltbXCJgIzkxO1JCaW5hcnlPcCM5MzsgIzYwOyM0NTtcbiAgICAgICgyKVxuICAgICAgKjEuMS02KlxuICAgICgwLCAxKWBcIl1dXG4gICAgYnVpbHQtaW46Xy1bXCJgQnVpbHQtSW46XG4jNjA7IzQ1O2BcIl1cbiAgICBzdHlsZSBidWlsdC1pbjpfLSBzdHJva2U6Z3JheSxmaWxsOmxpZ2h0Z3JheSxzdHJva2Utd2lkdGg6MnB4LG9wYWNpdHk6Ljg7XG4gICAgNChbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICg0KVxuICAgICAgKjEuMTUqYFwiXSlcbiAgICA2W1tcImAjOTE7UkZ1bmN0aW9uQ2FsbCM5MzsgcHJpbnRcbiAgICAgICg2KVxuICAgICAgKjEuOS0xNipcbiAgICAoNClgXCJdXVxuICAgIGJ1aWx0LWluOnByaW50W1wiYEJ1aWx0LUluOlxucHJpbnRgXCJdXG4gICAgc3R5bGUgYnVpbHQtaW46cHJpbnQgc3Ryb2tlOmdyYXksZmlsbDpsaWdodGdyYXksc3Ryb2tlLXdpZHRoOjJweCxvcGFjaXR5Oi44O1xuICAgIDAgLS0+fFwiZGVmaW5lZC1ieVwifCAxXG4gICAgMCAtLT58XCJkZWZpbmVkLWJ5XCJ8IDJcbiAgICAyIC0tPnxcImFyZ3VtZW50XCJ8IDFcbiAgICAyIC0tPnxcInJldHVybnMsIGFyZ3VtZW50XCJ8IDBcbiAgICAyIC0uLT58XCJyZWFkcywgY2FsbHNcInwgYnVpbHQtaW46Xy1cbiAgICBsaW5rU3R5bGUgNCBzdHJva2U6Z3JheTtcbiAgICA0IC0tPnxcInJlYWRzXCJ8IDBcbiAgICA2IC0tPnxcInJlYWRzLCByZXR1cm5zLCBhcmd1bWVudFwifCA0XG4gICAgNiAtLi0+fFwicmVhZHMsIGNhbGxzXCJ8IGJ1aWx0LWluOnByaW50XG4gICAgbGlua1N0eWxlIDcgc3Ryb2tlOmdyYXk7IiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19
Following the link output should show the following:
flowchart LR
1{{"`#91;RNumber#93; 1
(1)
*1.6*`"}}
0["`#91;RSymbol#93; x
(0)
*1.1*`"]
2[["`#91;RBinaryOp#93; #60;#45;
(2)
*1.1-6*
(0, 1)`"]]
built-in:_-["`Built-In:
#60;#45;`"]
style built-in:_- stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
4(["`#91;RSymbol#93; x
(4)
*1.15*`"])
6[["`#91;RFunctionCall#93; print
(6)
*1.9-16*
(4)`"]]
built-in:print["`Built-In:
print`"]
style built-in:print stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
0 -->|"defined-by"| 1
0 -->|"defined-by"| 2
2 -->|"argument"| 1
2 -->|"returns, argument"| 0
2 -.->|"reads, calls"| built-in:_-
linkStyle 4 stroke:gray;
4 -->|"reads"| 0
6 -->|"reads, returns, argument"| 4
6 -.->|"reads, calls"| built-in:print
linkStyle 7 stroke:gray;
(The analysis required 7.6 ms (including parse and normalize, using the r-shell engine) within the generation environment.)
Tip
All of these commands accept file paths as well, so you can write longer R code within a file, and then pass
the file path prefixed with file://
(e.g., file://test/testfiles/example.R
) to the commands.
Especially when you are just starting with flowR, we recommend using the REPL to explore the output of the different steps.
Note
Maybe you are left with the question: What is tree-sitter doing differently? Expand the following to get more information!
And what changes with tree-sitter?
Essentially not much (from a user perspective, it does essentially everything and all differently under the hood)! Have a look at the Engines wiki page for more information on the differences between the engines.
Below you can see the Repl commands for the tree-sitter engine (using --default-engine
to set the engine to tree-sitter):
$ docker run -it --rm eagleoutice/flowr --default-engine tree-sitter # or npm run flowr -- --default-engine tree-sitter
flowR repl using flowR v2.2.15, R grammar v14 (tree-sitter engine)
R> :parse "x <- 1; print(x)"
Output
program
├ binary_operator
│ ├ identifier "x" (1:1─2)
│ ├ <- "<-" (1:3─5)
│ ╰ float "1" (1:6─7)
╰ call
├ identifier "print" (1:9─14)
╰ arguments
├ ( "(" (1:14─15)
├ argument
│ ╰ identifier "x" (1:15─16)
╰ ) ")" (1:16─17)
This shows the ASCII-Art representation of the parse-tree of the R code x <- 1; print(x)
, as it is provided by the TreeSitterExecutor
. See the Engines wiki page for more information on the differences between the engines.
R> :normalize* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IFREXG4gICAgbjcoW1wiUkV4cHJlc3Npb25MaXN0ICg3KVxuIFwiXSlcbiAgICBuMihbXCJSQmluYXJ5T3AgKDIpXG4jNjA7IzQ1O1wiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMFwifCBuMlxuICAgIG4wKFtcIlJTeW1ib2wgKDApXG54XCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLWxoc1wifCBuMFxuICAgIG4xKFtcIlJOdW1iZXIgKDEpXG4xXCJdKVxuICAgIG4yIC0tPnxcImJpbm9wLXJoc1wifCBuMVxuICAgIG42KFtcIlJGdW5jdGlvbkNhbGwgKDYpXG5wcmludFwiXSlcbiAgICBuNyAtLT58XCJleHByLWxpc3QtY2hpbGQtMVwifCBuNlxuICAgIG4zKFtcIlJTeW1ib2wgKDMpXG5wcmludFwiXSlcbiAgICBuNiAtLT58XCJjYWxsLW5hbWVcInwgbjNcbiAgICBuNShbXCJSQXJndW1lbnQgKDUpXG54XCJdKVxuICAgIG42IC0tPnxcImNhbGwtYXJndW1lbnQtMVwifCBuNVxuICAgIG40KFtcIlJTeW1ib2wgKDQpXG54XCJdKVxuICAgIG41IC0tPnxcImFyZy12YWx1ZVwifCBuNFxuIiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19
Following the link output should show the following:
flowchart TD
n7(["RExpressionList (7)
"])
n2(["RBinaryOp (2)
#60;#45;"])
n7 -->|"expr-list-child-0"| n2
n0(["RSymbol (0)
x"])
n2 -->|"binop-lhs"| n0
n1(["RNumber (1)
1"])
n2 -->|"binop-rhs"| n1
n6(["RFunctionCall (6)
print"])
n7 -->|"expr-list-child-1"| n6
n3(["RSymbol (3)
print"])
n6 -->|"call-name"| n3
n5(["RArgument (5)
x"])
n6 -->|"call-argument-1"| n5
n4(["RSymbol (4)
x"])
n5 -->|"arg-value"| n4
(The analysis required 5.6 ms (including parsing with the tree-sitter engine) within the generation environment.)
R> :dataflow* "x <- 1; print(x)"
Output
https://mermaid.live/view#base64:eyJjb2RlIjoiZmxvd2NoYXJ0IEJUXG4gICAgMXt7XCJgIzkxO1JOdW1iZXIjOTM7IDFcbiAgICAgICgxKVxuICAgICAgKjEuNipgXCJ9fVxuICAgIDBbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICgwKVxuICAgICAgKjEuMSpgXCJdXG4gICAgMltbXCJgIzkxO1JCaW5hcnlPcCM5MzsgIzYwOyM0NTtcbiAgICAgICgyKVxuICAgICAgKjEuMS02KlxuICAgICgwLCAxKWBcIl1dXG4gICAgYnVpbHQtaW46Xy1bXCJgQnVpbHQtSW46XG4jNjA7IzQ1O2BcIl1cbiAgICBzdHlsZSBidWlsdC1pbjpfLSBzdHJva2U6Z3JheSxmaWxsOmxpZ2h0Z3JheSxzdHJva2Utd2lkdGg6MnB4LG9wYWNpdHk6Ljg7XG4gICAgNChbXCJgIzkxO1JTeW1ib2wjOTM7IHhcbiAgICAgICg0KVxuICAgICAgKjEuMTUqYFwiXSlcbiAgICA2W1tcImAjOTE7UkZ1bmN0aW9uQ2FsbCM5MzsgcHJpbnRcbiAgICAgICg2KVxuICAgICAgKjEuOS0xNipcbiAgICAoNClgXCJdXVxuICAgIGJ1aWx0LWluOnByaW50W1wiYEJ1aWx0LUluOlxucHJpbnRgXCJdXG4gICAgc3R5bGUgYnVpbHQtaW46cHJpbnQgc3Ryb2tlOmdyYXksZmlsbDpsaWdodGdyYXksc3Ryb2tlLXdpZHRoOjJweCxvcGFjaXR5Oi44O1xuICAgIDAgLS0+fFwiZGVmaW5lZC1ieVwifCAxXG4gICAgMCAtLT58XCJkZWZpbmVkLWJ5XCJ8IDJcbiAgICAyIC0tPnxcImFyZ3VtZW50XCJ8IDFcbiAgICAyIC0tPnxcInJldHVybnMsIGFyZ3VtZW50XCJ8IDBcbiAgICAyIC0uLT58XCJyZWFkcywgY2FsbHNcInwgYnVpbHQtaW46Xy1cbiAgICBsaW5rU3R5bGUgNCBzdHJva2U6Z3JheTtcbiAgICA0IC0tPnxcInJlYWRzXCJ8IDBcbiAgICA2IC0tPnxcInJlYWRzLCByZXR1cm5zLCBhcmd1bWVudFwifCA0XG4gICAgNiAtLi0+fFwicmVhZHMsIGNhbGxzXCJ8IGJ1aWx0LWluOnByaW50XG4gICAgbGlua1N0eWxlIDcgc3Ryb2tlOmdyYXk7IiwibWVybWFpZCI6eyJhdXRvU3luYyI6dHJ1ZX19
Following the link output should show the following:
flowchart LR
1{{"`#91;RNumber#93; 1
(1)
*1.6*`"}}
0["`#91;RSymbol#93; x
(0)
*1.1*`"]
2[["`#91;RBinaryOp#93; #60;#45;
(2)
*1.1-6*
(0, 1)`"]]
built-in:_-["`Built-In:
#60;#45;`"]
style built-in:_- stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
4(["`#91;RSymbol#93; x
(4)
*1.15*`"])
6[["`#91;RFunctionCall#93; print
(6)
*1.9-16*
(4)`"]]
built-in:print["`Built-In:
print`"]
style built-in:print stroke:gray,fill:lightgray,stroke-width:2px,opacity:.8;
0 -->|"defined-by"| 1
0 -->|"defined-by"| 2
2 -->|"argument"| 1
2 -->|"returns, argument"| 0
2 -.->|"reads, calls"| built-in:_-
linkStyle 4 stroke:gray;
4 -->|"reads"| 0
6 -->|"reads, returns, argument"| 4
6 -.->|"reads, calls"| built-in:print
linkStyle 7 stroke:gray;
(The analysis required 1.4 ms (including parse and normalize, using the tree-sitter engine) within the generation environment.)
The parsing step uses the RShell
to parse the input program (or, of course, the TreeSitterExecutor
when using the tree-sitter
engine).
To speed up the process, we use the initCommand
function to compile the parsing function and rely on a
custom serialization, which outputs the information in a CSV-like format.
This means, that the :parse
command actually kind-of lies to you, as it does pretty print the serialized version which looks more like the following (this uses the retrieveParseDataFromRCode
function with the sample code x <- 1; print(x)
):
Raw parse output for x <- 1; print(x)
For the code x <- 1; print(x)
:
[1,1,1,6,7,0,"expr",false,"x <- 1"],[1,1,1,1,1,3,"SYMBOL",true,"x"],[1,1,1,1,3,7,"expr",false,"x"],[1,3,1,4,2,7,"LEFT_ASSIGN",true,"<-"],[1,6,1,6,4,5,"NUM_CONST",true,"1"],[1,6,1,6,5,7,"expr",false,"1"],[1,7,1,7,6,0,"';'",true,";"],[1,9,1,16,19,0,"expr",false,"print(x)"],[1,9,1,13,10,12,"SYMBOL_FUNCTION_CALL",true,"print"],[1,9,1,13,12,19,"expr",false,"print"],[1,14,1,14,11,19,"'('",true,"("],[1,15,1,15,13,15,"SYMBOL",true,"x"],[1,15,1,15,15,19,"expr",false,"x"],[1,16,1,16,14,19,"')'",true,")"]
Beautiful, right? I thought so too! In fact, the output is a little bit nicer, when we put it into a table-format and add the appropriate headers:
Parse output in table format
For the code x <- 1; print(x)
:
line-start | col-start | line-end | col-end | id | parent | token type | terminal | text |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 6 | 7 | 0 | expr |
false | x <- 1 |
1 | 1 | 1 | 1 | 1 | 3 | SYMBOL |
true | x |
1 | 1 | 1 | 1 | 3 | 7 | expr |
false | x |
1 | 3 | 1 | 4 | 2 | 7 | LEFT_ASSIGN |
true | <- |
1 | 6 | 1 | 6 | 4 | 5 | NUM_CONST |
true | 1 |
1 | 6 | 1 | 6 | 5 | 7 | expr |
false | 1 |
1 | 7 | 1 | 7 | 6 | 0 | ';' |
true | ; |
1 | 9 | 1 | 16 | 19 | 0 | expr |
false | print(x) |
1 | 9 | 1 | 13 | 10 | 12 | SYMBOL_FUNCTION_CALL |
true | |
1 | 9 | 1 | 13 | 12 | 19 | expr |
false | |
1 | 14 | 1 | 14 | 11 | 19 | '(' |
true | ( |
1 | 15 | 1 | 15 | 13 | 15 | SYMBOL |
true | x |
1 | 15 | 1 | 15 | 15 | 19 | expr |
false | x |
1 | 16 | 1 | 16 | 14 | 19 | ')' |
true | ) |
In fact, this data is merely what R's base::parse
and utils::getParseData
functions provide.
We then use this data in the normalization step to create a normalized AST.
If you are interested in the raw token types that we may encounter, have a look at the RawRType
enum.
The normalization function normalize
takes the output from the previous steps and uses the prepareParsedData
and
convertPreparedParsedData
functions to first transform the serialized parsing output to an object.
Next, normalizeRootObjToAst
transforms this object to a normalized AST and decorateAst
adds additional information to the AST (like roles, ids, depth, etc.).
While looking at the mermaid visualization of such an AST is nice and usually sufficient, looking at the objects themselves shows you the full range of information the AST provides (all encompassed within the RNode
type).
Let's have a look at the normalized AST for the sample code x <- 1; print(x)
(please refer to the normalized AST wiki page for more information):
Normalized AST for x <- 1; print(x)
{
"type": "RExpressionList",
"children": [
{
"type": "RBinaryOp",
"location": [
1,
3,
1,
4
],
"lhs": {
"type": "RSymbol",
"location": [
1,
1,
1,
1
],
"content": "x",
"lexeme": "x",
"info": {
"fullRange": [
1,
1,
1,
1
],
"additionalTokens": [],
"id": 0,
"parent": 2,
"role": "binop-lhs",
"index": 0,
"nesting": 0
}
},
"rhs": {
"location": [
1,
6,
1,
6
],
"lexeme": "1",
"info": {
"fullRange": [
1,
6,
1,
6
],
"additionalTokens": [],
"id": 1,
"parent": 2,
"role": "binop-rhs",
"index": 1,
"nesting": 0
},
"type": "RNumber",
"content": {
"num": 1,
"complexNumber": false,
"markedAsInt": false
}
},
"operator": "<-",
"lexeme": "<-",
"info": {
"fullRange": [
1,
1,
1,
6
],
"additionalTokens": [],
"id": 2,
"parent": 7,
"nesting": 0,
"index": 0,
"role": "expr-list-child"
}
},
{
"type": "RFunctionCall",
"named": true,
"location": [
1,
9,
1,
13
],
"lexeme": "print",
"functionName": {
"type": "RSymbol",
"location": [
1,
9,
1,
13
],
"content": "print",
"lexeme": "print",
"info": {
"fullRange": [
1,
9,
1,
16
],
"additionalTokens": [],
"id": 3,
"parent": 6,
"role": "call-name",
"index": 0,
"nesting": 0
}
},
"arguments": [
{
"type": "RArgument",
"location": [
1,
15,
1,
15
],
"lexeme": "x",
"value": {
"type": "RSymbol",
"location": [
1,
15,
1,
15
],
"content": "x",
"lexeme": "x",
"info": {
"fullRange": [
1,
15,
1,
15
],
"additionalTokens": [],
"id": 4,
"parent": 5,
"role": "arg-value",
"index": 0,
"nesting": 0
}
},
"info": {
"fullRange": [
1,
15,
1,
15
],
"additionalTokens": [],
"id": 5,
"parent": 6,
"nesting": 0,
"index": 1,
"role": "call-argument"
}
}
],
"info": {
"fullRange": [
1,
9,
1,
16
],
"additionalTokens": [],
"id": 6,
"parent": 7,
"nesting": 0,
"index": 1,
"role": "expr-list-child"
}
}
],
"info": {
"additionalTokens": [],
"id": 7,
"nesting": 0,
"role": "root",
"index": 0
}
}
This is… a lot! We get the type from the RType
enum, the lexeme, location information, an id, the children of the node, and their parents.
While the normalized AST wiki page provides you with information on how to interpret this data, we will focus on how we get it from the
table provided by the parsing step.
There are two important functions: normalizeRootObjToAst
, which operates on the parse-output already transformed into a tree-like structure,
and decorateAst
, which adds additional information to the AST.
Both follow a fold pattern.
The fold is explicit for decorateAst
, which directly relies on the foldAstStateful
function,
while normalizeRootObjToAst
uses the fold-idiom but deviates in cases in which (for example) we require more information on other nodes to know what it should be normalized too.
We have a handler for everything. For example tryNormalizeIfThen
or tryNormalizeFor
to handle if(x) y
or for(i in 1:10) x
constructs.
All of these handlers contain many sanity checks to be sure that we talk to an RShell
which we can handle (as assumptions may break with newer versions).
These functions contain the keyword try
as they may fail. For example, whenever they notice late into normalization that they should actually be a different construct (R is great).
For single nodes, we use normalizeSingleNode
which contains a catch-all for some edge-cases in the R grammar.
The output of just this pass is listed below (using the normalizeButNotDecorated
function):
Ast for x <- 1; print(x)
after the first normalization
{
"type": "RExpressionList",
"children": [
{
"type": "RBinaryOp",
"location": [
1,
3,
1,
4
],
"lhs": {
"type": "RSymbol",
"location": [
1,
1,
1,
1
],
"content": "x",
"lexeme": "x",
"info": {
"fullRange": [
1,
1,
1,
1
],
"additionalTokens": []
}
},
"rhs": {
"location": [
1,
6,
1,
6
],
"lexeme": "1",
"info": {
"fullRange": [
1,
6,
1,
6
],
"additionalTokens": []
},
"type": "RNumber",
"content": {
"num": 1,
"complexNumber": false,
"markedAsInt": false
}
},
"operator": "<-",
"lexeme": "<-",
"info": {
"fullRange": [
1,
1,
1,
6
],
"additionalTokens": []
}
},
{
"type": "RFunctionCall",
"named": true,
"location": [
1,
9,
1,
13
],
"lexeme": "print",
"functionName": {
"type": "RSymbol",
"location": [
1,
9,
1,
13
],
"content": "print",
"lexeme": "print",
"info": {
"fullRange": [
1,
9,
1,
16
],
"additionalTokens": []
}
},
"arguments": [
{
"type": "RArgument",
"location": [
1,
15,
1,
15
],
"lexeme": "x",
"value": {
"type": "RSymbol",
"location": [
1,
15,
1,
15
],
"content": "x",
"lexeme": "x",
"info": {
"fullRange": [
1,
15,
1,
15
],
"additionalTokens": []
}
},
"info": {
"fullRange": [
1,
15,
1,
15
],
"additionalTokens": []
}
}
],
"info": {
"fullRange": [
1,
9,
1,
16
],
"additionalTokens": []
}
}
],
"info": {
"additionalTokens": []
}
}
The decoration is comparatively trivial. We take the AST throw it into the decorateAst
function (which again, handles each normalized node type) and
get:
- The AST with ids, roles, and depth information (see the normalized AST wiki page for more information).
- A mapping of ids to nodes in the form of a
AstIdMap
object. This allows us to quickly access nodes by their id.
The ids used for the AST generation are arbitrary (usually created by the deterministicCountingIdGenerator
) function) but unique and intentionally
separated from the ids used by the R parser. For one, this detaches us from the Engine used, and secondly, it allows for much easier
extension of the AST (e.g., when R files use base::source
to include other R files).
All ids conform to the NodeId
type.
The core of the dataflow graph generation works as a "stateful fold",
which uses the tree-like structure of the AST to combine the dataflow information of the children, while tracking the currently active variables and control flow
information as a “backpack” (state).
We use the produceDataFlowGraph
function as an entry point to the dataflow generation (the actual fold entry is in processDataflowFor
).
The function is mainly backed by its processors
object which maps each type in the normalized AST to an appropriate handler ("fold-function").
To understand these handlers, let's start with the simplest one, processUninterestingLeaf
signals that
we do not care about this node and just produce an empty dataflow information (using initializeCleanDataflowInformation
).
Looking at the function showcases the general structure of a processor:
-
Defined at ./src/dataflow/internal/process/process-uninteresting-leaf.ts#L5
export function processUninterestingLeaf<OtherInfo>(leaf: RNodeWithParent, info: DataflowProcessorInformation<OtherInfo>): DataflowInformation { return initializeCleanDataflowInformation(leaf.info.id, info); }
Every processor has the same shape. It takes the normalized node (see the normalized AST for more information),
and a DataflowProcessorInformation
object which, as some kind of "backpack" carries global information
to every handler.
This information is to be used to create a DataflowInformation
:
-
DataflowInformation
The dataflow information is one of the fundamental structures we have in the dataflow analysis. It is continuously updated during the dataflow analysis and holds its current state for the respective subtree processed. Each processor during the dataflow analysis may use the information from its children to produce a new state of the dataflow information. You may initialize a new dataflow information withinitializeCleanDataflowInformation
.Defined at ./src/dataflow/info.ts#L89
/** * The dataflow information is one of the fundamental structures we have in the dataflow analysis. * It is continuously updated during the dataflow analysis * and holds its current state for the respective subtree processed. * Each processor during the dataflow analysis may use the information from its children * to produce a new state of the dataflow information. * * You may initialize a new dataflow information with {@link initializeCleanDataflowInformation}. * * @see {@link DataflowCfgInformation} - the control flow aspects */ export interface DataflowInformation extends DataflowCfgInformation { /** * References that have not been identified as read or write and will be so on higher processors. * * For example, when we analyze the `x` vertex in `x <- 3`, we will first create an unknown reference for `x` * as we have not yet seen the assignment! * * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ... */ unknownReferences: readonly IdentifierReference[] /** * References which are read within the current subtree. * * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ... * */ in: readonly IdentifierReference[] /** * References which are written to within the current subtree * * @see {@link IdentifierReference} - a reference on a variable, parameter, function call, ... */ out: readonly IdentifierReference[] /** Current environments used for name resolution, probably updated on the next expression-list processing */ environment: REnvironmentInformation /** The current constructed dataflow graph */ graph: DataflowGraph }
View more (DataflowCfgInformation)
-
DataflowCfgInformation
The control flow information for the current DataflowInformation.Defined at ./src/dataflow/info.ts#L71
/** The control flow information for the current DataflowInformation. */ export interface DataflowCfgInformation { /** The entry node into the subgraph */ entryPoint: NodeId, /** All already identified exit points (active 'return'/'break'/'next'-likes) of the respective structure. */ exitPoints: readonly ExitPoint[] }
-
Essentially, these processors should use the dataflow information from their children combined with their own semantics
to produce a new dataflow information to pass upwards in the fold. The DataflowInformation
contains:
- the
DataflowGraph
of the current subtree - the currently active
REnvironmentInformation
as an abstraction of all active definitions linking to potential definition locations (see Advanced R::Environments) - control flow information in
DataflowCfgInformation
which is used to enrich the dataflow information with control flow information - and sets of currently ingoing (read), outgoing (write) and unknown
IdentifierReference
s.
While all of them are essentially empty when processing an “uninteresting leaf”, handling a constant is slightly more interesting with processValue
:
-
Defined at ./src/dataflow/internal/process/process-value.ts#L8
export function processValue<OtherInfo>({ info: { id } }: RNodeWithParent, data: DataflowProcessorInformation<OtherInfo>): DataflowInformation { return { unknownReferences: [], in: [{ nodeId: id, name: undefined, controlDependencies: data.controlDependencies, type: ReferenceType.Constant }], out: [], environment: data.environment, graph: new DataflowGraph(data.completeAst.idMap).addVertex({ tag: VertexType.Value, id: id, cds: data.controlDependencies }), exitPoints: [{ nodeId: id, type: ExitPointType.Default, controlDependencies: data.controlDependencies }], entryPoint: id }; }
Please note, that we add the value vertex to the newly created dataflow graph,
which holds a reference to the constant. If you are confused with the use of the ParentInformation
type,
this stems from the AST decoration and signals that we have a decorated RNode
(which may have additional information in OtherInfo
).
Yet again, this is not very interesting. When looking at the processors
object you may be confused by
many lines just mapping the node to the processAsNamedCall
function.
This is because during the dataflow analysis we actually "desugar" the AST, and treat syntax constructs like binary operators (e.g., x + y
) as function calls (e.g. `+`(x, y)
).
We do this, because R does it the same way, and allows to even overwrite these operators (including if
, <-
, etc.) by their name.
By treating them like R, as function calls, we get support for these overwrites for free, courtesy of flowR's call resolution.
But where are all the interesting things handled then?
For that, we want to have a look at the built-in environment, which can be freely configured using flowR's configuration system.
FlowR's heart and soul resides in the DefaultBuiltinConfig
object, which is used to configure the built-in environment
by mapping function names to BuiltInProcessorMapper
functions.
There you can find functions like processAccess
which handles the (subset) access to a variable,
or processForLoop
which handles the primitive for loop construct (whenever it is not overwritten).
Just as an example, we want to have a look at the processRepeatLoop
function, as it is one of the simplest built-in processors
we have:
-
Defined at ./src/dataflow/internal/process/functions/call/built-in/built-in-repeat-loop.ts#L19
export function processRepeatLoop<OtherInfo>( name: RSymbol<OtherInfo & ParentInformation>, args: readonly RFunctionArgument<OtherInfo & ParentInformation>[], rootId: NodeId, data: DataflowProcessorInformation<OtherInfo & ParentInformation> ): DataflowInformation { if(args.length !== 1 || args[0] === EmptyArgument) { dataflowLogger.warn(`Repeat-Loop ${name.content} does not have 1 argument, skipping`); return processKnownFunctionCall({ name, args, rootId, data, origin: 'default' }).information; } const unpacked = unpackArgument(args[0]); const { information, processedArguments } = processKnownFunctionCall({ name, args: unpacked ? [unpacked] : args, rootId, data, patchData: (d, i) => { if(i === 0) { return { ...d, controlDependencies: [...d.controlDependencies ?? [], { id: name.info.id }] }; } return d; }, markAsNSE: [0], origin: 'builtin:repeat-loop' }); const body = processedArguments[0]; guard(body !== undefined, () => `Repeat-Loop ${name.content} has no body, impossible!`); linkCircularRedefinitionsWithinALoop(information.graph, produceNameSharedIdMap(findNonLocalReads(information.graph, [])), body.out); information.exitPoints = filterOutLoopExitPoints(information.exitPoints); return information; }
Similar to any other built-in processor, we get the name of the function call which caused us to land here,
as well as the passed arguments. The rootId
refers to what caused the call to happen (and is usually just the function call),
while data
is our good old backpack, carrying all the information we need to produce a dataflow graph.
After a couple of common sanity checks at the beginning which we use to check whether the repeat loop is used in a way that we expect,
we start by issuing the fold continuation by processing its arguments. Given we expect repeat <body>
, we expect only a single argument.
During the processing we make sure to stitch in the correct control dependencies, adding the repeat loop to the mix.
For just the repeat loop the stitching is actually not necessary, but this way the handling is consistent for all looping constructs.
Afterward, we take the processedArguments
, perform another round of sanity checks and then use two special functions to apply the
semantic effects of the repeat loop. We first use one of flowR's linkers to
linkCircularRedefinitionsWithinALoop
and then retrieve the active exit points with filterOutLoopExitPoints
.
Feel free to have a look around and explore the other handlers for now. Each of them uses the results of its children alongside the active backpack to produce a new dataflow information.
Given the dataflow graph, you can do a lot more!
You can issue queries to explore the graph, search for specific elements, or, for example, request a static backward slice.
Of course, all of these endeavors work not just with the RShell
but also with the tree-sitter
engine.
The slicing is available as an extra step as you can see by inspecting he DEFAULT_SLICING_PIPELINE
.
Besides STATIC_SLICE
it contains a NAIVE_RECONSTRUCT
to print the slice as (executable) R code.
Your main point of interesting here is the staticSlicing
function which relies on a modified
breadth-first search to collect all nodes which are part of the slice.
For more information on how the slicing works, please refer to the tool demonstration (Section 3.2),
or the original master's thesis (Chapter 4).
You can explore the slicing using the REPL with the :slicer
command:
$ docker run -it --rm eagleoutice/flowr # or npm run flowr
flowR repl using flowR v2.2.15, R v4.5.0 (r-shell engine)
R> :slicer test/testfiles/example.R --criterion "12@product"
Output
product <- 1
N <- 10
for(i in 1:(N-1)) product <- product * i
product
Slice for the example file for the variable "prod" in line 12.
When using flowR from the CLI, you can use the --verbose
option to get more information about what flowR is doing.
While coding, however, you can use the function to set the minimum level of logs to be displayed (this works with the FlowrLogger
abstraction).
In general, you can configure the levels of individual logs, such as the general log
(obtained with getActiveLog
) or the parseLog
.
Please note that flowR makes no guarantees that log outputs are persistent across versions, and it is up to the implementors to provide sensible logging.
If you are an implementor and want to add logging, please make sure there are no larger runtime impliciations when logging is disabled.
Have a look at the expensiveTrace
function for example, which uses a function to generate the log message only when the log level is reached.
Currently maintained by Florian Sihler at Ulm University
Email | GitHub | Penguins | Portfolio