Skip to content

Latest commit

 

History

History
206 lines (172 loc) · 8.3 KB

gogll.md

File metadata and controls

206 lines (172 loc) · 8.3 KB

Gogll v3

Copyright 2021 Aaron Moss Copyright 2019 Marius Ackerman

This document contains a documented BNF specification for gogll 3. The formal specification is contained the markdown code blocks, which are delimited by triple backticks. Gogll treats all text outside the code blocks as whitespace.

This specification is an example of a gogll input specification. This version of gogll was generated by gogll from this specification.

Gogll treats this paragraph as whitespace. The following is the first code block of this document.

package "github.com/bruceiv/pegll"

Every gogll v3 input specification starts with a package specification. The package of this specification is "github.com/bruceiv/pegll". The packages generated by gogll will have this package prefix. For example: the gogll parser generated from this grammar has the following imports:

"github.com/bruceiv/pegll/lexer"
"github.com/bruceiv/pegll/parser/bsr"
"github.com/bruceiv/pegll/parser/slot"
"github.com/bruceiv/pegll/parser/symbols"
"github.com/bruceiv/pegll/token"

The following grammar defines the input specification for gogll. GoGLL is syntax start symbol and the root of the parse forest produced by the parser.

GoGLL : Package Rules ;

Package : "package" string_lit ;

Rules
    :   Rule            
    |   Rule Rules  
    ;

Rule : LexRule | SyntaxRule ;

The package specification is followed by one or more rules. Each rule can be a LexRule (token specification for the generated lexer) or a SyntaxRule (syntax specification for the generated parser). The first SyntaxRule is taken as the syntax start symbol.

Lexical Symbols

Gogll accepts UTF-8 input. In the following character refers to a Unicode character. Gogll and has the following built-in lexical symbols, which are used to specify tokens:

Symbol Matches
. Any character
any Any character in the string literal following any
not All characters execept those in the string literal following not
char_lit The specified character literal. See the lexical definition for char_lit below.
letter The Unicode category L
upcase Any uppercase letter
lowcase Any lowercase letter
number The Unicode category N
bracketed expression Grouped, optional or repeated alternates

any, not, letter, upcase, lowcase and number are the only reserved words of gogll.

LexSymbol : "." | "any" string_lit | char_lit | LexBracket | "not" string_lit | UnicodeClass ;
UnicodeClass : "letter" | "upcase" | "lowcase" | "number" ;
char_lit : '\'' (not "'" | '\\' any "\\'nrt") '\'' ;

char_lit is a character literal enclosed in single quotes. A char literal may be an escaped character:

char_lit Description
'\'' single quote
'\\' backslash
'\n' newline
'\r' carriage return
'\t' horizontal tab

Lexical symbols may be grouped and groups may be optional or repeated. For example:

Bracketed expression Meaning
( 𝜶 ) 𝜶 must occur once
[ 𝜶 ] 𝜶 may occur zero or one times
{ 𝜶 } 𝜶 may occur zero or more times
< 𝜶 > 𝜶 must occur one or more times

A bracketed expression may consist of one or more alternates:
𝜶 → β | 𝞬 | ... | δ Each alternate is a RegExp consisting of one or more LexSymbols

LexBracket : LexGroup | LexOptional | LexZeroOrMore | LexOneOrMore ;
LexGroup : "(" LexAlternates ")" ;
LexOptional : "[" LexAlternates "]" ;
LexZeroOrMore : "{" LexAlternates "}" ;
LexOneOrMore : "<" LexAlternates ">" ;
LexAlternates : RegExp | RegExp "|" LexAlternates ;
RegExp : LexSymbol | LexSymbol RegExp ;

Lexical Rules

Gogll generates a lexer from the set of string_lit SyntaxSymbols in the specification (see Syntax Rules below) plus the set of LexRules defined by the user.

LexRule
    : tokid ":" RegExp ";"
    | "!" tokid ":" RegExp ";"
    ;

The first alternate of LexRule is a normal token definition. The second alternate, which starts with ! defines a token that will be suppressed by the lexer. An example of the use of suppressed tokens is to define code comments. See example

tokid : lowcase <letter|number|'_'> ; 

tokid is a token ID, which starts with a lower case letter, followed by one or more letter, number or '_'. The following production, which defines a token string_lit, is an example of a LexRule.

string_lit : '"' {not "\\\"" | '\\' any "\\\"nrt"} '"' ;

string_lit defines a (possibly empty) string literal. Note that string_lit does not have the same set of escape characters as char_lit.

Syntax Rules

Gogll uses the specified syntax rules to generate the parser.

SyntaxRule : nt ":" SyntaxAlternates ";"  ;
nt : upcase <letter|number|'_'> ;

nt is a SyntaxRule ID and stands for Non-terminal. An nt is distinguished from a tokid by its first character, which is upper case. SyntaxRule is an example of itself. The nt of the rule is SyntaxRule.

SyntaxAlternates is the list of valid alternates of a syntax rule. Alternates can be either unordered (|) alternates or ordered (/) alternates. Unordered alternates may form an ambiguous parse, ordered alternates are disambiguated by choosing the first matching alternate in the order provided.

SyntaxAlternates
    :   SyntaxAlternate                   
    |   SyntaxAlternate "|" UnorderedAlternates
    |   SyntaxAlternate "/" OrderedAlternates
    ;

UnorderedAlternates
    : SyntaxAlternate
    | SyntaxAlternate "|" UnorderedAlternates
    ;

OrderedAlternates
    : SyntaxAlternate
    | SyntaxAlternate "/" OrderedAlternates
    ;

Each SyntaxAlternate is a sequence of SyntaxSymbol. An optional syntax rule may have an alternate empty.

SyntaxAlternate
    :   SyntaxSymbols                     
    |   "empty"                     
    ;

SyntaxSymbols
    :   SyntaxSymbol                      
    |   SyntaxSymbol SyntaxSymbols              
    ;

A SyntaxSymbol is a single element to be matched. Syntax symbols can be nonterminals, token IDs, or literal strings.

Any of these options may be preceded by one of the two lookahead operators, & and !. & is positive lookahead, which matches if its argument matches at the given position, without consuming any input itself (i.e. the next syntax symbol is matched starting at the same input position); ! is negative lookahead, which matches only if its argument does not match at the given position, but also consumes no input.

Lookahead operators may not be nested, but the two provided operators are sufficient to produce any desired semantics. Note that in PEGLL, some common uses of lookahead operators (e.g. excluding keywords from matching identifiers) are more-efficiently implemented as lexical rules rather than using the semantic lookahead operator.

SyntaxSymbol
    : "&" SyntaxAtom
    / "!" SyntaxAtom
    / SyntaxSuffix
    / SyntaxAtom
    ;

SyntaxAtom : nt | tokid | string_lit ;

A SyntaxSuffix is a syntax rule that has a suffix operator. SyntaxSuffix is a SyntaxSymbol. Any of these options may be followed by one of the three syntax suffixes, ?, *, or + to create syntactic sugar for making rules optional, repeatable zero or more times, or repeatable one or more times, respectably.

? is optional suffix, which matches if its argument matches at the end of a SyntaxAtom, without consuming any input itself. The expression is appended to an ordered empty node to the rule making it optional, and has the following form: Optional : Expr / empty ;

* is repeat zero or more times suffix, which matches if its argument matches at the end of a SyntaxAtom, without consuming any input itself. The expression is appended recursively piping to an ordered empty node to the rule making it repeatable zero or more times, and has the following form: Rep0x : Expr Rep0x / empty ;

+ is repeat one or more times suffix, which matches if its argument matches at the end of a SyntaxAtom, without consuming any input itself. The expression is appended recursively to itself and to an ordered empty node to the rule making it repeatable one or more times, and has the following form: Rep1x : Expr Rep0x / empty ;

SyntaxSuffix : SyntaxAtom "?" 
             | SyntaxAtom "*" 
             | SyntaxAtom "+" ;