Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc #7

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions doc/src/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Presentation

This book will introduce you to parsing and transliteration, using Beans. Beans is written in
[Rust](https://www.rust-lang.org), and henceforth this book will assume familiarity with this
language. However, this book makes no assumptions on prior knowledge on parsing techniques. The
end goal is to allow someone who has never written or used a parser to quickly become productive
at writing and using parsing libraries.
This book will introduce you to parsing and transliteration, using
Beans. Beans is written in [Rust](https://www.rust-lang.org), and
henceforth this book will assume familiarity with this
language. However, this book makes no assumptions on prior knowledge
of parsing techniques. The end goal is to allow someone who has never
written or used a parser to quickly become productive at writing and
using parsing libraries.

Beans aims at being a general-purpose parser and lexer library, providing both enough
performance so that you should never *need* something faster (even though these options exist),
and enough expressiveness so that you never get stuck while using your parser. See the
[tradeoffs](details/tradeoff.md) section for more details.
Beans aims at being a general-purpose parser and lexer library,
providing both enough performance so that you should never *need*
something faster (even though these options exist), and enough
expressiveness so that you never get stuck while using your
parser. See the [tradeoffs](details/tradeoff.md) section for more
details.

Beans is free and open source, dual licensed MIT or GPL3+, at your choice.
Beans is free and open source, dual licensed MIT or GPL3+, at your
choice.
22 changes: 13 additions & 9 deletions doc/src/concepts/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Common Concepts

When parsing with Beans, as with most other similar tools, three steps are performed, in this
order:
When parsing with Beans, as with most other similar tools, three steps
are performed, in this order:
* [Lexing](lexer.md)
* [Parsing](parser.md)
* [Syntax tree building](ast.md)

The first step, lexing, operates directly on plain text inputs, while the last is in charge of
producing the abstract syntax tree. For more details on the operations that can be performed on
the latter, please refer to the [Rewriting the AST Chapter](ast/README.md).
The first step, lexing, operates directly on plain text inputs, while
the last is in charge of producing the abstract syntax tree. For more
details on the operations that can be performed on the latter, please
refer to the chapter [Rewriting the AST](ast/README.md).

# Simple arithmetic expression

Throughout the explanation of the core concepts of parsing, some simple grammars will be written
to allow parsing a language of simple arithmetic expressions, consisting of numbers or binary
operations (addition, multiplication, subtraction and division) on expressions. All the grammars
will be available at https://github.com/jthulhu/beans, in the directory `doc/examples/arith`.
Throughout the explanation of the core concepts of parsing, some
simple grammars will be written to allow parsing a language of simple
arithmetic expressions, consisting of numbers or binary operations
(addition, multiplication, subtraction and division) on
expressions. All the grammars will be available at
https://github.com/jthulhu/beans, in the directory
`doc/examples/arith`.
1 change: 1 addition & 0 deletions doc/src/concepts/grammars.md
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
# Grammars

113 changes: 65 additions & 48 deletions doc/src/concepts/lexer.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,62 +2,76 @@

## What does a lexer do?

A lexer performs the initial, important step of grouping together characters that couldn't be
morphologically split, while removing useless ones. For instance, in most programming languages,
spaces are only useful to split words, they do not have any intrinsic meaning. Therefore, they
should be dumped by the lexer, whereas all the characters that form an identifier or a keyword
should be grouped together to form a single *token*.
A lexer performs the initial, important step of grouping together
characters that <span style=color:red>couldn't be morphologically split [Eh?]</span>, while removing
useless ones. For instance, in most programming languages, spaces are
only useful to split words, they do not have any intrinsic
meaning. Therefore, they should be dumped by the lexer, whereas all
the characters that form an identifier or a keyword should be grouped
together to form a single *token*.

> Note: a *token*, also called a *terminal symbol* or more shortly a *terminal*, is a minimal
> span of text of the input with an identified meaning. For instance, any identifier, keyword
> or operator would be considered a token.
> Note: a *token*, also called a *terminal symbol* or more shortly a
> *terminal*, is a minimal span of text of the input with an
> identified meaning. For instance, any identifier, keyword or
> operator would be considered a token.

Both the parser and the lexer in Beans use online algorithms, meaning that they will consume
their input as they process it. Beans' lexer will consume the input string one unicode character
at a time. The lexer might backtrack, but this is, in practice, very rare. Non-degenerate
grammars will never trigger such backtracking.
Both the parser and the lexer in Beans use <span style=color:red>online [Con degli URL?!?]</span> algorithms, meaning
that they will consume their input as they process it. Beans' lexer
will consume the input string one unicode character at a time. The
lexer might backtrack, but this is, in practice, very
rare. Non-degenerate grammars will never trigger such backtracking.

As the lexer reads the input, it will produce tokens. Sometimes (as with whitespace), it will
discard them. Other times, it might forget what the exact characters where, it will just remember
which token has been read.
As the lexer reads the input, it will produce tokens. Sometimes (as
with whitespace), it will discard them. Other times, it <span style=color:red>might [Oppure?]</span> forget
what the exact characters where and will just remember which token has
been read.

## Regular expression

Each terminal in Beans is recognized by matching its associated regular expression. Prior
knowledge of regular expressions is assumed. Since regular expressions have loads of different
specifications, here is an exhaustive list of features allowed in Beans regular expressions,
besides the usual disjunction operator `|`, character classes `[...]` or `[^...]` and repetition
with `+`, `*` and `?`.
Each terminal in Beans is recognized by matching its associated
regular expression. Prior knowledge of regular expressions is
assumed. Since regular expressions have many different
specifications, here is an exhaustive list of features allowed in
Beans regular expressions, besides the usual disjunction operator `|`,
character classes `[...]` or `[^...]` and repetition with `+`, `*` and
`?`.

<span style=color:red>Che cos'è ϵ?</span>

| Escaped character | Name | Meaning |
|-------------------|----------------|---------------------------------------------------------------------------|
| `\b` | Word bounary | matches `ϵ` if the previous or the next character are not word characters |
| `\b` | Word boundary | matches `ϵ` if the previous or the next character are not word characters |
| `\w` | Word character | equivalent to [a-zA-Z0-9] |
| `\t` | Tabulation | matches a tabulation |
| `\Z` or `\z` | End of file | matches `ϵ` at the end of the line |
| `\Z` or `\z` | End of file | matches `ϵ` at the end of the <span style=color:red>line [file?]</span> |
| `\d` | Digit | equivalent to [0-9] |
| `\n` | Newline | matches an end of line |
| `\s` | Whitespace | matches whatever unicode considers whitespace |
| | | |

# Simple arithmetic lexer

Let's try to write a lexer grammar for the simple arithmetic expression language. Ideally, we
would like to parse expressions such as `1+2*3`. So let's start by defining an integer token.
In `arith.lx`, write
Let's try to write a lexer grammar for the simple arithmetic
expression language. Ideally, we would like to parse expressions such
as `1+2*3`. So let's start by defining an integer token. In
`arith.lx`, write
```beans-lx
INTEGER ::= \d+
```
Let's get through this first definition. `INTEGER` is the name of the terminal, whereas what is
on the right side of `::=` is the regular expression used to match it.
Let's get through this first definition. `INTEGER` is the name of the
terminal, whereas what is on the right side of `::=` is the regular
expression used to match it.

> Note: spaces between `::=` and the start of the regular expression are ignored, but every other
> space will be taken into account, including trailing ones, which are easy to overlook. If the
> regular expression starts with a space, you can always wrap it in a singleton class `[ ]`.
> Note: spaces between `::=` and the start of the regular expression
> are ignored, but every other space will be taken into account,
> including trailing ones, which are easy to overlook. If the regular
> expression starts with a space, you can always wrap it in a
> singleton class `[ ]`.

> Note: terminals are always SCREAMING CASED. While this is not very readable nor practical to
> type, it is coherent with the literature, and will allow you to distinguish between variables
> (which will be snake_cased), non terminals (which will be Pascal Cased) and terminals later on.
> Note: terminals are always SCREAMING CASED. While this is not very
> readable nor practical to type, it is coherent with the literature,
> and will allow you to distinguish between variables (which will be
> snake_cased), non terminals (which will be Pascal Cased) and
> terminals later on.

We can also add the terminals for the four other operators
```beans-lx
Expand All @@ -66,7 +80,8 @@ MULTIPLY ::= \*
SUBTRACT ::= -
DIVIDE ::= /
```
If we were to try to lex a file `input` containing the expression `1+2*3`, we would get
If we were to try to lex a file `input` containing the expression
`1+2*3`, we would get
```bash
$ beans lex --lexer arith.lx input
INTEGER
Expand All @@ -77,12 +92,14 @@ INTEGER
Error: Could not lex anything in file input, at character 5 of line 1.
$
```
This is bad for two reasons. The first is, of course, that we get an error. This is because our
file ended with a newline `\n`, and that there is no terminal that matches it. In fact, we would
also have a problem if we tried to lex `1 + 2*3`, because no terminal can read spaces. However,
we also *don't* want to produce any token related to such spaces: `1+2*3` and `1 + 2*3` should
be lexed indentically. Thus we will introduce a `SPACE` token with the `ignore` flag, telling
the lexer not to output it. Similarly for `NEWLINE`.
This is bad for two reasons. The first is, of course, that we get an
error. This is because our file ended with a newline `\n`, and that
there is no terminal that matches it. In fact, we would also have a
problem if we tried to lex `1 + 2*3`, because no terminal can read
spaces. However, we also *don't* want to produce any token related to
such spaces: `1+2*3` and `1 + 2*3` should be lexed indentically. Thus
we will introduce a `SPACE` token with the `ignore` flag, telling the
lexer not to output it. Similarly for `NEWLINE`.
```beanx-lx
ignore SPACE ::= \s+
ignore NEWLINE ::= \n+
Expand All @@ -99,15 +116,16 @@ $
```
Nice!

However, we now face the second issue: it was probably wise to forget the specific character that
was lexed to `ADD` or `MULTIPLY`, because we don't care; but we don't want to forget the actual
integer we lexed. To correct this, we will use regex groups. In `arith.lx`, we will replace the
definition of `INTEGER` with
However, we now face the second issue: it was probably wise to forget
the specific character that was lexed to `ADD` or `MULTIPLY`, because
we don't care; but we don't want to forget the actual integer we
lexed. To correct this, we will use regex groups. In `arith.lx`, we
will replace the definition of `INTEGER` with
```beans-lx
INTEGER ::= (\d+)
```
This will create a group that will contain everything that `\d+` will match, and this information
will be passed with the created token.
This will create a group that will contain everything that `\d+` will
match, and this information will be passed with the created token.
```bash
$ beans lex --lexer arith.lx input
INTEGER {0: 1}
Expand All @@ -118,4 +136,3 @@ INTEGER {0: 3}
$
```
We will see in the next section how to manipulate a stream of tokens.

Loading