Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe: indentation-sensitive languages #246

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
135 changes: 135 additions & 0 deletions hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about indentation in the implementation: How do you distinguish between spaces and tabs?
This could be an interesting point of the configuration to show here. Maybe in an own section or as appendix.
How to align this with an editor-config, see https://editorconfig.org or other approaches?

I guess you choose only spaces or tabs for the WS token, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question!
I thought a lot about the best approach here, and in the end decided not to discriminate between them, which is the simpler way. Alternatives included allowing only one or the other through a config parameter, or treating a tab as n spaces (again, for a configurable n). I thought these 2 alternatives were a bit too strict (though that's how Python behaves, for example, by prohibiting mixing them), and I thought that ideally I could issue a warning, but I couldn't find a way to accept a token and still issue a warning/error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about it, I could add some payload to the returned token and then in the lexer check for the payload and add to the errors array, but then there would still be no way of making it a warning rather than an error. Perhaps LexerResult should be augmented to allow warnings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.

Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.

Copy link
Member Author

@aabounegm aabounegm Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not yet have experience writing formatters in Langium, but I don't see why it would be difficult to do. Generally, there are 2 approaches: formatting and pretty printing. One way to implement a formatter is to search for some (anti-)patterns in the code and issue TextEdits just for them. Pretty printers normally use the AST/CST (or some other intermediate representation) and transform them back into code, regardless of how it initially looked like before parsing. (or at least that's how I understand the difference between them)
Both approaches seem possible with the indentation-aware tokens, though the second one (pretty printer) is probably easier to implement, assuming we want the formatter to ensure consistent indentation characters and sizes.

For Python, one of the most popular formatter is black, and it uses the pretty printing approach. Not sure how other formatters handle inconsistent indentation, tbh.

title: Indentation-sensitive languages
weight: 300
---

Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript).
This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free.
To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine using Chevrotain in a custom token builder.

Starting with Langium v3.2, such token builder (and an accompanying lexer) are provided for easy plugging into your language.
Lotes marked this conversation as resolved.
Show resolved Hide resolved
They work by modifying the underlying Chevrotain token generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels. This is why you should provide it with the names of the tokens you used to denote indentation: so it can override the correct tokens for your grammar.

## Configuring the token builder and lexer

To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L78)
and [`IndentationAwareLexer`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L358)
services in your module as such:

```ts
import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium';

// ...
export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = {
// ...
parser: {
TokenBuilder: () => new IndentationAwareTokenBuilder(),
Lexer: (services) => new IndentationAwareLexer(services),
},
};
// ...
```

The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file, as well as a list of delimiter tokens inside of which indentation should be ignored. It defaults to:
```ts
{
indentTokenName: 'INDENT',
dedentTokenName: 'DEDENT',
whitespaceTokenName: 'WS',
ignoreIndentationDelimiters: [],
}
```

### Ignoring indentation between specific tokens

Sometimes, it is necessary to ignore any indentation token inside some expressions, such as with tuples and lists in Python. For example, in the following statement:
```python
x = [
1,
2
]
```
any indentation between `[` and `]` should be ignored.

To achieve similar behavior with the `IndentationAwareTokenBuilder`, the `ignoreIndentationDelimiters` option can be used.
It accepts a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair.

For example, if you construct the `IndentationAwareTokenBuilder` with the following options:
```ts
new IndentationAwareTokenBuilder({
ignoreIndentationDelimiters: [
['[', ']'],
['(', ')'],
],
})
```
then no indentation tokens will be emitted between either of those pairs of tokens.

### Configuration options type safety

The `IndentationAwareTokenBuilder` supports generic type parameters to improve type-safety and IntelliSense of its options.
This helps detect when a token name has been mistyped or changed in the grammar.
The first generic parameter corresponds to the names of terminal tokens, while the second one corresonds to the names of keyword tokens.
Both parameters are optional and can be imported from `./generated/ast.js` and used as such:

```ts
import { MyLanguageTerminalNames, MyLanguageKeywordNames } from './generated/ast.js';
import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium';

// ...
export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = {
parser: {
TokenBuilder: () => new IndentationAwareTokenBuilder<MyLanguageTerminalNames, MyLanguageKeywordNames>({
ignoreIndentationDelimiters: [
['L_BRAC', 'R_BARC'], // <-- This typo will now cause a TypeScript error
aabounegm marked this conversation as resolved.
Show resolved Hide resolved
]
}),
Lexer: (services) => new IndentationAwareLexer(services),
},
};
```

## Writing the grammar

In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them).
For example, let's define the grammar for a simple version of Python with support for only `if` and `return` statements, and only booleans as expressions:

```langium
grammar PythonIf

entry Statement: If | Return;

If:
'if' condition=BOOLEAN ':'
INDENT thenBlock+=Statement+
DEDENT
('else' ':'
INDENT elseBlock+=Statement+
DEDENT)?;

Return: 'return' value=BOOLEAN;

terminal BOOLEAN returns boolean: /true|false/;
terminal INDENT: 'synthetic:indent';
terminal DEDENT: 'synthetic:dedent';
hidden terminal WS: /[\t ]+/;
hidden terminal NL: /[\r\n]+/;
Lotes marked this conversation as resolved.
Show resolved Hide resolved
```

The important terminals here are `INDENT`, `DEDENT`, and `WS`.
`INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages.
Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`.
Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` is necessary because a simple `\s+` will match the new line character, as well as any possible indentation after it. To ensure correct behavior, the token builder modifies the pattern of the `whitespaceTokenName` token to be `[\t ]+`, so a separate hidden token for new lines needs to be explicitly defined.

The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground.
Lotes marked this conversation as resolved.
Show resolved Hide resolved

With the default configuration and the grammar above, for the following code sample:
```
if true:
return false
else:
if true:
return true
```

the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with capital T

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended as a continuation of the sentence before it, only interrupted by the code snippet. Not sure if it makes sense or if it counts as a separate sentence 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.

Loading