Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new grammar renderer #1787

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented Apr 10, 2025

This introduces a new grammar renderer. Instead of trying to write the grammar in markdown/html hybrid, this introduces a new syntax that is parsed by the mdbook-spec plugin. This grammar is then converted into markdown/html hybrid, and also to railroad diagrams.

There are a lot of changes here (and some can be split into separate PRs if desired). A general overview of what to see here:

  • Grammar rules are now written inside a code block (instead of a blockquote). The syntax is pretty similar to the old syntax with various small changes. See the docs/grammar.md file for a complete description.
  • There is now a summary chapter which shows the entire grammar all on one page.
  • The grammar is parsed by mdbook-spec/src/grammar/parser.rs into an internal representation.
  • The internal representation is converted to markdown in mdbook-spec/src/grammar/render_markdown.rs, and railroad diagrams in mdbook-spec/src/grammar/render_railroad.rs.
    • The railroad diagrams are generated using the railroad crate.
    • There is a toggle button the show/hide the railroad diagrams. It uses localstorage to keep that state sticky.
  • The basic definitions and driver in the mdbook plugin is in mdbook-spec/src/grammar.rs. There are several pieces here:
    • The internal representation.
    • Code to load the grammar from the code blocks inside the chapters.
    • Some validation.
    • Code that will replace the code block with the rendered output.
    • Code for handling the summary chapter.
  • All nonterminals are now linked to the rule definition.
  • The text may now link to grammar rules by just putting them in brackets like [FunctionParameters]. Link definitions are automatically added to every page.
  • Some rules were added or changed to accommodate the new renderer. I think all changes are put into separate commits to help with reviewing.
  • Various misc fixes, see the individual commits.

I'd like to thank @lukaslueg for creating the railroad library which made this possible.

Closes #221
Closes #398
Closes #596
Closes #1513
Closes #1677

ehuss added 18 commits April 10, 2025 14:31
Just fixing some small consistency and spacing mistakes.
This rule was misnamed, colliding with the existing CfgAttrAttribute.
This renames IsolatedCR to CR. I felt like it wasn't exactly necessary
since we have rewritten things so that it is clear that there is an
input transformation which resolves this (`input.crlf`). We also never
really defined what it meant.

I also felt like there was room for confusion. For example, an input
containing `CR CR LF LF` would get normalized to `CR LF`. The `CR` there
is not isolated.
This removes all backslash escaped characters. This helps to avoid
confusing similarities with a literal backslash followed by a character
versus the interpreted escaped character.
I don't exactly know why this was placed there, but we operate under the
assumption that all lexical characters immediately follow one another.
This introduces a new terminal kind that I'm calling a "prose" which
describes what the terminal is. This is inspired by the IETF format
which uses angle brackets to describe terminals in English.
The grammar almost always uses lowercase, so let's standardize on that.
This helps to standardize how suffixes are written. Normally they do not
use parentheses, and visually I don't think they entirely necessary.
These two nonterminals were using the wrong name for the productions for
BlockExpression and LiteralExpression.
This changes the keyword listings so that they are just lists instead of
lexer rules. We never used the named rules, and I don't foresee us ever
doing that. Also, the `IDENTIFIER_OR_KEYWORD` rule meant that we never
needed to explicitly identify these keywords as lexer tokens.

This helps avoid problems when building the grammar graph for missing
connections.
Per our style, edition differences are supposed to be separated out into
an edition block.
These were defined in prose below, but defining them here allows us to
easily refer and link to them.
This is intended to help define what a "token" is via the grammar (and
to fill a missing hole in our token definition).

I waffled on how to define delimiters, whether they should be separate
somehow. In practice I think it should be fine to clump them all
together. This mainly only matters for TokenTree which already excludes
the delimiters.
This adds a grammar rule that collects all the reserved token forms into
a single production rule so that we can define what a "token" is by
referring to this.
This defines a Token in the grammar so that we can easily refer to it
(and to make it easier to see what all the tokens are).
We no longer represent characters via escape sequences. These can be
confused with the literal two bytes of backslash followed by a
character. See the "common productions" list for how these are now
referred to.
@rustbot rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Apr 10, 2025
@ehuss ehuss force-pushed the railroad-grammar branch from a8be867 to 4a8e44f Compare April 11, 2025 01:23
@lukaslueg
Copy link

railroad upstream here.

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

AFAICS there are cases where the grammar diverges from its graphical representation with respect to repeated elements. In the two examples below, the diagram only allows for for at least two consecutive Statement (... two consecutive Expression), while the grammar requires "at least one".

Bildschirmfoto 2025-04-11 um 14 15 55
Bildschirmfoto 2025-04-11 um 14 16 43

@ia0
Copy link

ia0 commented Apr 11, 2025

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:

   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'

What about doing it like this?

->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'

This uses one less path and can be concatenated with a previous foo for +-repeat:

->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'

The main problem is that foo is somehow to be read backwards, which may confuse people at first.

@lukaslueg
Copy link

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:

   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'

What about doing it like this?

->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'

This uses one less path and can be concatenated with a previous foo for +-repeat:

->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'

The main problem is that foo is somehow to be read backwards, which may confuse people at first.

With respect to *-Elements, both examples (1 and 2) are technically valid. See the Zero or more table-constraints block in the create-table-stmt example, which demonstrates the second case.
Also see the One or more column-definitions-example, which should cover the +-Element case in example 3.

It's possible to implement dyn Node downstream to build more specialized primitives for certain situations. For instance, one might want to cook up a graphical representation for the "any character except ..."-case. Upstream might also provide them, if the need arises.

@ia0
Copy link

ia0 commented Apr 11, 2025

I see, so that's already supported and just a matter of generating the proper diagram downstream.

ehuss added 6 commits April 11, 2025 08:53
This adds an extension to mdbook-spec that will parse code-blocks in a
BNF-style grammar into a rendered format, in both markdown or as
railroad diagrams.
This adds the hooks to toggle the visibility of the railroad grammar.
The status is stored in localstorage to keep it sticky.
This fixes it so that rule links work correctly if there is more than
one space in a reference definition.
Not sure how this got missed.
@ehuss ehuss force-pushed the railroad-grammar branch from 4a8e44f to b2c1d6a Compare April 11, 2025 15:53
@ehuss
Copy link
Contributor Author

ehuss commented Apr 11, 2025

Oops, thanks! I messed up all the repeats.

I'm a bit uncertain on how to handle the non-greedy ones, but I think a label works ok for now. Same with the a..b repeat range, and I went with a comment for that, which I think works ok?

@traviscross
Copy link
Contributor

traviscross commented Apr 11, 2025

For instance, one might want to cook up a graphical representation for the "any character except ..."-case. Upstream might also provide them, if the need arises.

@lukaslueg: I'm curious if you have thoughts about what would make a good graphical representation for this. As I'm sure you saw, we use this pattern a lot, and I had found myself wondering about whether there might be a good visual way to encode this.

@lukaslueg
Copy link

Some comments and suggestions, which you hopefully find constructive

  • A LabeledBox can use any primitive as the label, including a Sequence, just just plain text. See example 12_select-stmt_diagram.txt in railroad_dsl. This might be useful in situations where a Terminal or NonTerminal is referenced, as in "except \0 or \x00" and "except $ and delimiters" (where "delimiters" is a NonTerminal); yes, the NonTerminal in the Comment can have a clickable Link :-) railroad_dsl's syntax allows to access that, and this here might as well.
    Bildschirmfoto 2025-04-11 um 18 22 41
    Bildschirmfoto 2025-04-11 um 18 24 17
  • Common pre- and suffixes in diverging arms (Choice) can be combined, shortening the Choice. One can use mathematical rigor to compute that, yet I found that real-world scenarios benefit from less strictness; there is a rather ridiculously complex implementation for this in macro_railroad here (unless you can control this manually here?!).
    Bildschirmfoto 2025-04-11 um 18 28 25
  • There are some mistakes with respect to repeating elements in b2c1d6ada. Notice that reading direction reverses for the combining element (right to left, since we are going backwards within the syntax). For instance, the syntax for StructFields allows for "StructField StructField , ," (struct foobar { foo: i32 bar: i32, , }). Also notice that WhereClause allows for where , T: std::fmt::Debug U: std::fmt::Debug (both commas wrong). Unless you already did, you might want to have a look at the examples in railroad_dsl, which (afaics) cover those cases here (*/+/?).
    Bildschirmfoto 2025-04-11 um 18 30 45
    Bildschirmfoto 2025-04-11 um 18 32 37
  • One can crate a dyn Node that wraps a LabeledBox, where the wrapping type pre-sets the label and adds another CSS-class in order to visualize non greedy regions (think "green box")
  • With respect to "Anything except" (@traviscross) my first inclination would be a LabeledBox with custom CSS (think "red box"). Remember that the label can be a dyn Node itself, so the exceptions to the rule can at least have some unicode label (⚠) automatically attached to them.
  • "Anything except" is actually just a case of "early parse error". There is no primitive for that, as we would also have to disambiguate diverging arms if anything in the diagram could lead to a parse error. AFAICS this is not a good solution.

@traviscross

This comment was marked as resolved.

@lukaslueg
Copy link

lukaslueg commented Apr 11, 2025

The problem with "except" is that its not immediately clear if the except-element is an immediate parse error or a delayed parse error, in case both are possible. E.g. if Foo X is invalid but Foo XY is not, the graphical (stateless!) representation becomes murky and hard to follow. Following the diagram does not require knowing where we are coming from // no memory.

As suggested above, imho a comment with NonTerminal/Terminal inside them - spiced up with some css - probably suffices, as there probably are cases which are either complex or verbose, and require clear text anyway ("except \u{0}, \u{00}, …, \u{000000}").

These should definitely get flipped around: Whatever "any character" is should be a NonTerminal/Terminal, and the exception (Terminal LF) should be part of the comment. If we had a dyn Node for "except" ("yellow box" - red is error) it would clearly indicate that there is an exception to what the NonTerminal can match.

Bildschirmfoto 2025-04-11 um 19 46 23

@traviscross
Copy link
Contributor

Yes, makes sense.

@lukaslueg
Copy link

See here for a quick and dirty example of how an except node might look like

Bildschirmfoto 2025-04-11 um 21 00 53

@traviscross
Copy link
Contributor

That's quite nice, yes. That way, the "except for" nodes aren't in the main railroad, which is what makes the other ways awkward.

ehuss added 6 commits April 13, 2025 10:36
I didn't want to try to add an unused grammar rule here, so just point
to the notation chapter which has an example.
This updates some of the description for the new grammar renderer.
This just ends up with lots of duplicates in the search results which I
don't find particularly helpful.
Railroad renders these in reverse order, but our grammar isn't written
that way.
This is the suggestions from lukaslueg
@ehuss ehuss force-pushed the railroad-grammar branch from b2c1d6a to 1214d68 Compare April 13, 2025 17:37
@ehuss
Copy link
Contributor Author

ehuss commented Apr 13, 2025

Thanks @lukaslueg! I went ahead and added your suggestion for the Except clauses.

To fix the repeat expressions, I had to essentially reverse the Sequence blocks inside a repeat. I couldn't find any other way to get around the HDir invert.

As for some of the other suggestions here, such as rendering nodes inside comments, those sound like good things to try in the future. That in particular might be challenging because the content is written in markdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: The marked PR is awaiting review from a maintainer
Projects
None yet
5 participants