Copyright 2021 Aaron Moss Copyright 2019 Marius Ackerman
This document contains a documented BNF specification for gogll 3. The formal specification is contained the markdown code blocks, which are delimited by triple backticks. Gogll treats all text outside the code blocks as whitespace.
This specification is an example of a gogll input specification. This version of gogll was generated by gogll from this specification.
Gogll treats this paragraph as whitespace. The following is the first code block of this document.
package "github.com/bruceiv/pegll"
Every gogll v3 input specification starts with a package specification. The
package of this specification is "github.com/bruceiv/pegll"
.
The packages generated by gogll will have this package prefix.
For example: the gogll parser generated from this grammar has the following
imports:
"github.com/bruceiv/pegll/lexer"
"github.com/bruceiv/pegll/parser/bsr"
"github.com/bruceiv/pegll/parser/slot"
"github.com/bruceiv/pegll/parser/symbols"
"github.com/bruceiv/pegll/token"
The following grammar defines the input specification for gogll. GoGLL
is syntax
start symbol and the root of the parse forest produced by the parser.
GoGLL : Package Rules ;
Package : "package" string_lit ;
Rules
: Rule
| Rule Rules
;
Rule : LexRule | SyntaxRule ;
The package specification is followed by one or more rules. Each rule can be a
LexRule
(token specification for the generated lexer) or a
SyntaxRule
(syntax specification for the generated parser).
The first SyntaxRule
is taken as the syntax start symbol.
Gogll accepts UTF-8 input. In the following character refers to a Unicode character. Gogll and has the following built-in lexical symbols, which are used to specify tokens:
Symbol | Matches |
---|---|
. | Any character |
any | Any character in the string literal following any |
not | All characters execept those in the string literal following not |
char_lit | The specified character literal. See the lexical definition for char_lit below. |
letter | The Unicode category L |
upcase | Any uppercase letter |
lowcase | Any lowercase letter |
number | The Unicode category N |
bracketed expression | Grouped, optional or repeated alternates |
any
, not
, letter
, upcase
, lowcase
and number
are the only reserved
words of gogll.
LexSymbol : "." | "any" string_lit | char_lit | LexBracket | "not" string_lit | UnicodeClass ;
UnicodeClass : "letter" | "upcase" | "lowcase" | "number" ;
char_lit : '\'' (not "'" | '\\' any "\\'nrt") '\'' ;
char_lit
is a character literal enclosed in single quotes. A char literal may
be an escaped character:
char_lit | Description |
---|---|
'\'' |
single quote |
'\\' |
backslash |
'\n' |
newline |
'\r' |
carriage return |
'\t' |
horizontal tab |
Lexical symbols may be grouped and groups may be optional or repeated. For example:
Bracketed expression | Meaning |
---|---|
( 𝜶 ) |
𝜶 must occur once |
[ 𝜶 ] |
𝜶 may occur zero or one times |
{ 𝜶 } |
𝜶 may occur zero or more times |
< 𝜶 > |
𝜶 must occur one or more times |
A bracketed expression may consist of one or more alternates:
𝜶 → β | 𝞬 | ... | δ
Each alternate is a RegExp
consisting of one or more LexSymbols
LexBracket : LexGroup | LexOptional | LexZeroOrMore | LexOneOrMore ;
LexGroup : "(" LexAlternates ")" ;
LexOptional : "[" LexAlternates "]" ;
LexZeroOrMore : "{" LexAlternates "}" ;
LexOneOrMore : "<" LexAlternates ">" ;
LexAlternates : RegExp | RegExp "|" LexAlternates ;
RegExp : LexSymbol | LexSymbol RegExp ;
Gogll generates a lexer from the set of string_lit
SyntaxSymbol
s in the specification
(see Syntax Rules below) plus the set of LexRule
s defined by the user.
LexRule
: tokid ":" RegExp ";"
| "!" tokid ":" RegExp ";"
;
The first alternate of LexRule
is a normal token definition. The second alternate, which starts with !
defines a token that will be suppressed by the lexer. An example of the use of suppressed tokens is to define code comments.
See example
tokid : lowcase <letter|number|'_'> ;
tokid
is a token ID, which starts with a lower case letter, followed by one
or more letter
, number
or '_'
.
The following production, which defines a token string_lit
,
is an example of a LexRule
.
string_lit : '"' {not "\\\"" | '\\' any "\\\"nrt"} '"' ;
string_lit
defines a (possibly empty) string literal. Note that string_lit
does not have the same set of escape characters as char_lit
.
Gogll uses the specified syntax rules to generate the parser.
SyntaxRule : nt ":" SyntaxAlternates ";" ;
nt : upcase <letter|number|'_'> ;
nt
is a SyntaxRule
ID and stands for Non-terminal. An nt
is distinguished
from a tokid
by its first character, which is upper case. SyntaxRule
is
an example of itself. The nt
of the rule is SyntaxRule
.
SyntaxAlternates
is the list of valid alternates of a syntax rule. Alternates can be either unordered (|
) alternates or ordered (/
) alternates. Unordered alternates may form an ambiguous parse, ordered alternates are disambiguated by choosing the first matching alternate in the order provided.
SyntaxAlternates
: SyntaxAlternate
| SyntaxAlternate "|" UnorderedAlternates
| SyntaxAlternate "/" OrderedAlternates
;
UnorderedAlternates
: SyntaxAlternate
| SyntaxAlternate "|" UnorderedAlternates
;
OrderedAlternates
: SyntaxAlternate
| SyntaxAlternate "/" OrderedAlternates
;
Each SyntaxAlternate
is a sequence of SyntaxSymbol
. An optional syntax rule may
have an alternate empty
.
SyntaxAlternate
: SyntaxSymbols
| "empty"
;
SyntaxSymbols
: SyntaxSymbol
| SyntaxSymbol SyntaxSymbols
;
A SyntaxSymbol
is a single element to be matched. Syntax symbols can be nonterminals, token IDs, or literal strings.
Any of these options may be preceded by one of the two lookahead operators, &
and !
. &
is positive lookahead, which matches if its argument matches at the given position, without consuming any input itself (i.e. the next syntax symbol is matched starting at the same input position); !
is negative lookahead, which matches only if its argument does not match at the given position, but also consumes no input.
Lookahead operators may not be nested, but the two provided operators are sufficient to produce any desired semantics. Note that in PEGLL, some common uses of lookahead operators (e.g. excluding keywords from matching identifiers) are more-efficiently implemented as lexical rules rather than using the semantic lookahead operator.
SyntaxSymbol
: "&" SyntaxAtom
/ "!" SyntaxAtom
/ SyntaxSuffix
/ SyntaxAtom
;
SyntaxAtom : nt | tokid | string_lit ;
A SyntaxSuffix
is a syntax rule that has a suffix operator. SyntaxSuffix
is a SyntaxSymbol
. Any of these options may be followed by one of the three syntax suffixes, ?
, *
, or +
to create syntactic sugar for making rules optional, repeatable zero or more times, or repeatable one or more times, respectably.
?
is optional suffix, which matches if its argument matches at the end of a SyntaxAtom
, without consuming any input itself. The expression is appended to an ordered empty node to the rule making it optional, and has the following form:
Optional : Expr / empty ;
*
is repeat zero or more times suffix, which matches if its argument matches at the end of a SyntaxAtom
, without consuming any input itself. The expression is appended recursively piping to an ordered empty node to the rule making it repeatable zero or more times, and has the following form:
Rep0x : Expr Rep0x / empty ;
+
is repeat one or more times suffix, which matches if its argument matches at the end of a SyntaxAtom
, without consuming any input itself. The expression is appended recursively to itself and to an ordered empty node to the rule making it repeatable one or more times, and has the following form:
Rep1x : Expr Rep0x / empty ;
SyntaxSuffix : SyntaxAtom "?"
| SyntaxAtom "*"
| SyntaxAtom "+" ;