DHIS2 Expression Language

What is DHIS2 expression language?

    // math
    1 + 1

    // functions
    log(100)

    // logic
    true && 2 > 1

    // dataResolver
    #{abcdefghijk}

Parsing - Grammars - Basics

Butchered (but useful) terminology:

Terminal:

something that is not composed of parts
a leaf in a tree

Examples: literals (numbers, constants, strings)

Non-Terminal:

something that is composed of parts
a node in a tree

Examples: operators

Production Rule:

NonTerminal = (NonTerminal | Terminal)+

The non-terminal on the left is defined as the sequence of non-terminal and terminals on the right.

More (usually):

a | b => a OR b (in order of precedence)
+ left block 1 or more times
* left block 0 or more times
? left block 0 or 1 times
{n} left block exactly n times
{n,m} left block at least n but at most m times
(a b) grouping to a block
[a-z] a character set of the letters a-z
or [a b] ~~could mean: same as (a b)?~~ (not here)

Parser Jargon

consume: moving current position forward in the input
gobble: consume and discard input (ignore, like WS)

CFGs (Context-Free Grammars)

https://en.wikipedia.org/wiki/Context-free_grammar

reply on precedence (order of declaration) to express intent
use a "solver" to find the "best match" given a set of rules

A toy expression language:

expr = term '*' term
     | term '+' term
term = number | constant

Example expression:

a + b * c + d

a possible AST

    +
   / \
  +   d
 / \
a   *
   / \
  b   c

or short

(a + (b * c)) + d

but could also be

a + ((b * c) + d)

Backtracking

Parsing is a left to right process...

((a + b) * c) + d

😩 that isn't right... lets go back and try again, this time trying something different

a + ((b * c) + d)

better 😌

Parsing is somewhere between linear and factorial complexity

PEG (Parsing Expression Grammar)

https://en.wikipedia.org/wiki/Parsing_expression_grammar

PEGs cannot be ambiguous (unlike CFGs)
if a string parses, it has exactly one valid parse tree
(presumably: OR not allowed in production rules)

The toy example again, defined slightly different

expr = term (op term)*
term = number | constant
op   = '+' | '*'

After a term if there is a + or * there must be another term and so forth, otherwise this is an illegal input for the language.

Parsing is linear*

* theoretically one could build backtracking into PEG parsers...

Lookahead

Usually refers to the number of "not yet consumed" characters in the input that need to be considered to make the right choice.

expr   = number | date
number = [0-9]+('.'[0-9]+)?
date   = [0-9]{4}'.'[0-9]{2}'.'[0-9]{2}

How can we decide that we are reading a number or a date?

=> A date has two .

We need to look ahead and see of if we find the second .

PEG grammar must be "decidable" purely based on lookahead.
Ideally the lookahead is 1 character only => fast, simple, unambiguous by design

How to do correct operator precedence in linear time?

a + b * c + d

Linearly parsed gives

a + (b * (c + d))

😩

What if operators are leaves? We get:

a, +, b, *, c, +, d

Everything is in a "flat" sequence of typed nodes.

Now we walk the "tree" and merge only the operator with the highest precedence into a structured operator with children:

1. a, +, (b * c), +, d
2. ((a + (b * c)) + d)

Voila. 😌

The time is still linear as there is a fixed number of operators to do.

CFG vs PEG

CFGs (ANTLR)

bad syntax choices ("collisions") are first recognised much later (solver hides them)
solver means a framework is used, which means limitations
multiple transformations because of the layers of abstraction
whitespace is hard to control as it is implicitly assumed
worst case complexity is factorial
=> bend your problem to suit the parser

PEGs

decidability problem forces to recognise and solve collisions right away
language methods are the "framework"
direct translation (as complicated as needed but not more)
just methods calling each other (possibly with a convenience layer on top)
whitespace is no different but needs explicit consideration and modelling
generally: "special" handling is not different
worst case complexity is linear*
=> write the parser to suit the problem

How do PEGs work?

Key idea:

void what(Input in, Context ctx);

what is the name of the token/block processed
Input
- is whatever is processed and "consumed" while parsing
Context
- is whatever is build, the "output"
- usually emitting the base dataResolver for creating nodes in an AST
- also might hold state like lookup by name

Example Grammar:

expr = term (op term)*
term = number | constant
op   = '+' | '*'

PEG parser:

void expr(Input in, Context ctx) {
    term(in, ctx);
    char c = in.lookahead();
    while (c == '+' || c == '*') { // isOperator(c)
        op(in, ctx);
        term(in, ctx);
        c = in.lookahead();
    }
}
void term(Input in, Context ctx) {
    in.consumeWhitespace();
    char c = in.lookahead();
    if (isDigit(c)) {
        number(in, ctx);
    } else if (isLetter(c)){
        constant(in, ctx);
    } else {
        throw in.error();    
    }
    in.consumeWhitespace();
}
void op(Input in, Context ctx) {
    char op = in.consume();
    ctx.emitOperator(op);
}
void number(Input in, Context ctx) {
    String n = in.consumeWhile(c -> isDigit(c) || c == '.');
    ctx.emitNumber(n);
}
void constant(Input in, Context ctx) {
    String c = in.consumeWhile(c -> isLetter(c));    
    ctx.emitConstant(c);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRESENTATION.md

PRESENTATION.md

DHIS2 Expression Language

Parsing - Grammars - Basics

Parser Jargon

CFGs (Context-Free Grammars)

Backtracking

PEG (Parsing Expression Grammar)

Lookahead

How to do correct operator precedence in linear time?

CFG vs PEG

How do PEGs work?

Files

PRESENTATION.md

Latest commit

History

PRESENTATION.md

File metadata and controls

DHIS2 Expression Language

Parsing - Grammars - Basics

Parser Jargon

CFGs (Context-Free Grammars)

Backtracking

PEG (Parsing Expression Grammar)

Lookahead

How to do correct operator precedence in linear time?

CFG vs PEG

How do PEGs work?