Recipe: indentation-sensitive languages #246

aabounegm · 2024-07-18T12:20:06Z

This is a guide on how to use the new IndentationAwareTokenBuilder, added to Langium in eclipse-langium/langium#1578 (published in v3.2.0).

github-actions · 2024-07-18T12:22:18Z

PR Preview Action v1.4.6
🚀 Deployed preview to https://eclipse-langium.github.io/langium-previews/pr-previews/pr-246/
on branch `previews` at 2024-08-29 14:46 UTC

Lotes

Really nice recipe. I have some comments about the insights to the implementation. I think we should add links or detailed information how this token builder/lexer works.

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

aabounegm · 2024-08-07T11:32:22Z

Thanks for your comments! I will address them as soon as a decision is reached about eclipse-langium/langium#1608 to edit the last section as well

aabounegm · 2024-08-25T20:13:07Z

@Lotes Thanks for waiting so long, all comments should be addressed now.
Note that the last commit adds a subsection that depends on eclipse-langium/langium#1647 getting merged first

Lotes

Just some minor things and one big question. The most of my thoughts were already resolved. Thanks.

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

Lotes · 2024-08-29T12:06:28Z

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

+        return true
+```
+
+the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.


The with capital T

It was intended as a continuation of the sentence before it, only interrupted by the code snippet. Not sure if it makes sense or if it counts as a separate sentence 🤔

Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

Lotes · 2024-08-29T12:14:14Z

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

@@ -0,0 +1,135 @@
+---


Question about indentation in the implementation: How do you distinguish between spaces and tabs?
This could be an interesting point of the configuration to show here. Maybe in an own section or as appendix.
How to align this with an editor-config, see https://editorconfig.org or other approaches?

I guess you choose only spaces or tabs for the WS token, right?

That's a good question!
I thought a lot about the best approach here, and in the end decided not to discriminate between them, which is the simpler way. Alternatives included allowing only one or the other through a config parameter, or treating a tab as n spaces (again, for a configurable n). I thought these 2 alternatives were a bit too strict (though that's how Python behaves, for example, by prohibiting mixing them), and I thought that ideally I could issue a warning, but I couldn't find a way to accept a token and still issue a warning/error.

Actually, now that I think about it, I could add some payload to the returned token and then in the lexer check for the payload and add to the errors array, but then there would still be no way of making it a warning rather than an error. Perhaps LexerResult should be augmented to allow warnings?

I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.

Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.

I do not yet have experience writing formatters in Langium, but I don't see why it would be difficult to do. Generally, there are 2 approaches: formatting and pretty printing. One way to implement a formatter is to search for some (anti-)patterns in the code and issue TextEdits just for them. Pretty printers normally use the AST/CST (or some other intermediate representation) and transform them back into code, regardless of how it initially looked like before parsing. (or at least that's how I understand the difference between them)
Both approaches seem possible with the indentation-aware tokens, though the second one (pretty printer) is probably easier to implement, assuming we want the formatter to ensure consistent indentation characters and sizes.

For Python, one of the most popular formatter is black, and it uses the pretty printing approach. Not sure how other formatters handle inconsistent indentation, tbh.

Lotes

I have nothing more to add right now. Let's wait for a second opinion on your recipe :-) .

Lotes · 2024-08-30T06:51:06Z

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

+        return true
+```
+
+the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`.


Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.

Lotes · 2024-08-30T07:02:55Z

hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md

@@ -0,0 +1,135 @@
+---


I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.

Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.

aabounegm · 2024-09-23T13:17:00Z

@Lotes Langium v3.2 has already been published with the indentation-aware token builder, and questions about its usage already started coming in (eclipse-langium/langium#1696). Are we waiting for another review or can this recipe be merged?

Add a guide for indentation-sensitive language

e7f56cc

aabounegm temporarily deployed to pull-request-preview July 18, 2024 12:21 — with GitHub Actions Inactive

Lotes reviewed Aug 7, 2024

View reviewed changes

Merge branch 'eclipse-langium:main' into main

f002a86

aabounegm temporarily deployed to pull-request-preview August 22, 2024 14:28 — with GitHub Actions Inactive

aabounegm added 6 commits August 22, 2024 14:36

Add links to the TokenBuilder & Lexer

a293fc4

Add a short explanation on how the solution works

2395760

Remove playground compatibility section

566f5b3

Clarify why WS is split into 2 tokens

92b72d3

Add an example snippet

518844f

Document the ignoreIndentationDelimiters option

b6cf6e2

aabounegm temporarily deployed to pull-request-preview August 25, 2024 20:12 — with GitHub Actions Inactive

Lotes reviewed Aug 29, 2024

View reviewed changes

Remove extranneous "is"

d51e2e0

aabounegm requested a review from Lotes August 29, 2024 14:45

aabounegm deployed to pull-request-preview August 29, 2024 14:46 — with GitHub Actions View deployment

Lotes reviewed Aug 30, 2024

View reviewed changes

ym-han mentioned this pull request Sep 14, 2024

[Optional for first demo] Langium grammar: Convert braces to indentation; adjust rest of grammar accordingly to account for indentation sensitivity smucclaw/lam4#89

Open

Lotes requested a review from msujew October 1, 2024 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe: indentation-sensitive languages #246

Recipe: indentation-sensitive languages #246

aabounegm commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Jul 18, 2024 •

edited

Loading

Lotes left a comment

aabounegm commented Aug 7, 2024

aabounegm commented Aug 25, 2024 •

edited

Loading

Lotes left a comment

Lotes Aug 29, 2024

aabounegm Aug 29, 2024

Lotes Aug 30, 2024

Lotes Aug 29, 2024

aabounegm Aug 29, 2024

aabounegm Aug 29, 2024

Lotes Aug 30, 2024

aabounegm Aug 30, 2024 •

edited

Loading

Lotes left a comment

Lotes Aug 30, 2024

Lotes Aug 30, 2024

aabounegm commented Sep 23, 2024

Recipe: indentation-sensitive languages #246

Are you sure you want to change the base?

Recipe: indentation-sensitive languages #246

Conversation

aabounegm commented Jul 18, 2024 • edited Loading

github-actions bot commented Jul 18, 2024 • edited Loading

Lotes left a comment

Choose a reason for hiding this comment

aabounegm commented Aug 7, 2024

aabounegm commented Aug 25, 2024 • edited Loading

Lotes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aabounegm Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Lotes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aabounegm commented Sep 23, 2024

aabounegm commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Jul 18, 2024 •

edited

Loading

aabounegm commented Aug 25, 2024 •

edited

Loading

aabounegm Aug 30, 2024 •

edited

Loading