Add a new grammar renderer #1787

ehuss · 2025-04-10T21:51:01Z

This introduces a new grammar renderer. Instead of trying to write the grammar in markdown/html hybrid, this introduces a new syntax that is parsed by the mdbook-spec plugin. This grammar is then converted into markdown/html hybrid, and also to railroad diagrams.

There are a lot of changes here (and some can be split into separate PRs if desired). A general overview of what to see here:

Grammar rules are now written inside a code block (instead of a blockquote). The syntax is pretty similar to the old syntax with various small changes. See the docs/grammar.md file for a complete description.
There is now a summary chapter which shows the entire grammar all on one page.
The grammar is parsed by mdbook-spec/src/grammar/parser.rs into an internal representation.
The internal representation is converted to markdown in mdbook-spec/src/grammar/render_markdown.rs, and railroad diagrams in mdbook-spec/src/grammar/render_railroad.rs.
- The railroad diagrams are generated using the railroad crate.
- There is a toggle button the show/hide the railroad diagrams. It uses localstorage to keep that state sticky.
The basic definitions and driver in the mdbook plugin is in mdbook-spec/src/grammar.rs. There are several pieces here:
- The internal representation.
- Code to load the grammar from the code blocks inside the chapters.
- Some validation.
- Code that will replace the code block with the rendered output.
- Code for handling the summary chapter.
All nonterminals are now linked to the rule definition.
The text may now link to grammar rules by just putting them in brackets like [FunctionParameters]. Link definitions are automatically added to every page.
Some rules were added or changed to accommodate the new renderer. I think all changes are put into separate commits to help with reviewing.
Various misc fixes, see the individual commits.

I'd like to thank @lukaslueg for creating the railroad library which made this possible.

Closes #221
Closes #398
Closes #596
Closes #1513
Closes #1677

Just fixing some small consistency and spacing mistakes.

This rule was misnamed, colliding with the existing CfgAttrAttribute.

This renames IsolatedCR to CR. I felt like it wasn't exactly necessary since we have rewritten things so that it is clear that there is an input transformation which resolves this (`input.crlf`). We also never really defined what it meant. I also felt like there was room for confusion. For example, an input containing `CR CR LF LF` would get normalized to `CR LF`. The `CR` there is not isolated.

This removes all backslash escaped characters. This helps to avoid confusing similarities with a literal backslash followed by a character versus the interpreted escaped character.

I don't exactly know why this was placed there, but we operate under the assumption that all lexical characters immediately follow one another.

This introduces a new terminal kind that I'm calling a "prose" which describes what the terminal is. This is inspired by the IETF format which uses angle brackets to describe terminals in English.

The grammar almost always uses lowercase, so let's standardize on that.

This helps to standardize how suffixes are written. Normally they do not use parentheses, and visually I don't think they entirely necessary.

These two nonterminals were using the wrong name for the productions for BlockExpression and LiteralExpression.

This changes the keyword listings so that they are just lists instead of lexer rules. We never used the named rules, and I don't foresee us ever doing that. Also, the `IDENTIFIER_OR_KEYWORD` rule meant that we never needed to explicitly identify these keywords as lexer tokens. This helps avoid problems when building the grammar graph for missing connections.

Per our style, edition differences are supposed to be separated out into an edition block.

These were defined in prose below, but defining them here allows us to easily refer and link to them.

This is intended to help define what a "token" is via the grammar (and to fill a missing hole in our token definition). I waffled on how to define delimiters, whether they should be separate somehow. In practice I think it should be fine to clump them all together. This mainly only matters for TokenTree which already excludes the delimiters.

This adds a grammar rule that collects all the reserved token forms into a single production rule so that we can define what a "token" is by referring to this.

This defines a Token in the grammar so that we can easily refer to it (and to make it easier to see what all the tokens are).

We no longer represent characters via escape sequences. These can be confused with the literal two bytes of backslash followed by a character. See the "common productions" list for how these are now referred to.

lukaslueg · 2025-04-11T12:30:10Z

railroad upstream here.

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

AFAICS there are cases where the grammar diverges from its graphical representation with respect to repeated elements. In the two examples below, the diagram only allows for for at least two consecutive Statement (... two consecutive Expression), while the grammar requires "at least one".

ia0 · 2025-04-11T13:02:02Z

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:

   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'

What about doing it like this?

->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'

This uses one less path and can be concatenated with a previous foo for +-repeat:

->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'

The main problem is that foo is somehow to be read backwards, which may confuse people at first.

lukaslueg · 2025-04-11T13:20:58Z

The railroad codebase hasn't seen a lot of love with respect to graphical layout. Suggestions are welcome.

From the live demo I can see that *-repeated elements are in theory printed like this:
   .------->-------.
   |               |
->-+-+--[ foo ]--+-+--->-
     |           |
     '-----<-----'
What about doing it like this?
->-+------>------+->-
   |             |
   '-<-[ foo ]-<-'
This uses one less path and can be concatenated with a previous foo for +-repeat:
->-[ foo ]->-+------>------+->-
             |             |
             '-<-[ foo ]-<-'
The main problem is that foo is somehow to be read backwards, which may confuse people at first.

With respect to *-Elements, both examples (1 and 2) are technically valid. See the Zero or more table-constraints block in the create-table-stmt example, which demonstrates the second case.
Also see the One or more column-definitions-example, which should cover the +-Element case in example 3.

It's possible to implement dyn Node downstream to build more specialized primitives for certain situations. For instance, one might want to cook up a graphical representation for the "any character except ..."-case. Upstream might also provide them, if the need arises.

ia0 · 2025-04-11T13:34:16Z

I see, so that's already supported and just a matter of generating the proper diagram downstream.

This adds an extension to mdbook-spec that will parse code-blocks in a BNF-style grammar into a rendered format, in both markdown or as railroad diagrams.

This adds the hooks to toggle the visibility of the railroad grammar. The status is stored in localstorage to keep it sticky.

…e page

This fixes it so that rule links work correctly if there is more than one space in a reference definition.

traviscross · 2025-04-14T11:33:39Z

The conflicting directions one would be resolved by #1787 (comment). The UNICODE_ESCAPE one is actually the correct grammar. E.g.:

fn main() {
    let x: &str = "\u{a____________________________________}";
    println!("_{x}_");
}

Playground link

(On INNER_LINE_DOC, were you meaning to point out some problem by highlighting it?)

lukaslueg · 2025-04-14T12:06:45Z

The UNICODE_ESCAPE one is actually the correct grammar. E.g.:

👀 I was actually to lazy to check, sorry for the confusion. [The syntax is somewhat hilarious?!]

(On INNER_LINE_DOC, were you meaning to point out some problem by highlighting it?)

On INNER_LINE_DOC, I was highlighting the fact that reading direction - graphically indicated by the arrows - is correct (green), while in INNER_BLOCK_DOC reading direction gets inverted on char [CHAR]-branch (red); as a mental image: "two trains would collide head-on in the red sections".

mdbook-spec/src/grammar/render_railroad.rs

mdbook-spec/src/grammar.rs

We track the "roots" in our grammar -- those productions that aren't used in any other production. We want to report when a new root appears or when something that's expected to be a root no longer is one. However, we were reporting the latter case as the former instead of reporting it separately as intended. Let's fix that.

There are two ways to render a "zero or more" (i.e. `*`) repeat. One is to put nothing on the main forward line and to put the pattern on the recurrent edge, and the other is to put the pattern on the main forward line and to have an empty recurrent edge and an empty bypass edge. That is, for the latter, we can think of `thing*` as `(thing+)?`. Doing it that latter way means an additional edge, but it buys us something big in return, which is that it keeps all the patterns going in the forward direction. Doing it the other way means the patterns have to be reversed so as to put them underneath on that recurrent edge, and it means that readers then have to read them right to left. Reversing the elements also causes a bug in some diagrams where the lines end up running in opposing directions and so the trains crash into each other. See: - rust-lang#1787 (comment) Keeping things in the forward direction avoids this problem. In this commit, we'll leave in place all the infrastructure for reversing the elements though it is no longer used. We can of course pull this out later.

traviscross · 2025-04-14T20:52:30Z

I've pushed up a set of commits. I had originally planned to merge this first and do these separately, but they're somewhat intertwined with fixing issues that we probably should fix here, so perhaps it's best to look at these now.

There are two ways to render a "zero or more" (i.e. `*`) repeat. One is to put nothing on the main forward line and to put the pattern on the recurrent edge, and the other is to put the pattern on the main forward line and to have an empty recurrent edge and an empty bypass edge. That is, for the latter, we can think of `thing*` as `(thing+)?`. Doing it that latter way means an additional edge, but it buys us something big in return, which is that it keeps all the patterns going in the forward direction. Doing it the other way means the patterns have to be reversed so as to put them underneath on that recurrent edge, and it means that readers then have to read them right to left. Reversing the elements also causes a bug in some diagrams where the lines end up running in opposing directions and so the trains crash into each other. See: - rust-lang#1787 (comment) Keeping things in the forward direction avoids this problem. In this commit, we'll leave in place all the infrastructure for reversing the elements though it is no longer used. We can of course pull this out later.

We check that the list of grammar "roots" -- that is, productions that are not used in any other production -- is what we expect it to be. We had hard coded this list of roots in `mdbook-spec`. Let's instead add a way to specify this in our syntax for productions by prefixing the production with `@root`.

When reviewing a production in the grammar, one often wants to quickly find the corresponding railroad diagram, and when reviewing a railroad diagram, one often wants to quickly find the corresponding production in the grammar. Let's make this easy by linking each production in the grammar to the corresponding railroad diagram, and from the name of each railroad diagram to the corresponding production in the grammar. When clicking on a production in the grammar, we'll automatically display the railroad diagrams if those are not already displayed.

We can save a line by replacing this `match` with a `let-else`, so let's do that.

There are two ways to render a "zero or more" (i.e. `*`) repeat. One is to put nothing on the main forward line and to put the pattern on the recurrent edge, and the other is to put the pattern on the main forward line and to have an empty recurrent edge and an empty bypass edge. That is, for the latter, we can think of `thing*` as `(thing+)?`. Doing it that latter way means an additional edge, but it buys us something big in return, which is that it keeps all the patterns going in the forward direction. Doing it the other way means the patterns have to be reversed so as to put them underneath on that recurrent edge, and it means that readers then have to read them right to left. Reversing the elements also causes a bug in some diagrams where the lines end up running in opposing directions and so the trains crash into each other. See: - rust-lang#1787 (comment) Keeping things in the forward direction avoids this problem. In this commit, we'll leave in place all the infrastructure for reversing the elements though it is no longer used. We can of course pull this out later.

We no longer need to reverse the elements anywhere in our railroad diagrams, so let's remove the supporting infrastructure for doing this.

For `RepeatRange(e, a, b)`, we were rendering `e` on the main line then rendering under it a message about how many times it may or must repeat based on `a` and `b`. The trouble is that if we say that something "repeats once" on the recurrent edge -- after we've already consumed a thing -- that reads reasonably as though we're saying that two things can be consumed when that's not what we mean. Similarly, it's a bit odd to say, on the recurrent edge, that something must "repeat twice" when that edge (and presumably then that rule) may not be taken at all. Let's solve all this by doing the following: - For `e{1..1}`, simply render the node. - For `e{0..1}`, treat this as simply `e?`. - For `e{0..}`, treat this as simply `e*`. - For `e{1..}`, treat this as simply `e+`. - For `e{a..0}`, render an empty node. - For `e{0..b} b > 1`, treat this as `(e{1..b})?`. - For `e{1..b} b > 1`, render the node on the main line, then on the recurrent line render "at most {b - 1} more times". - For `e{a..b} a > 1`, make a sequence of length `a` where the final node repeats `{1..b - (a - 1)}` times (or `{1..}` times if `b` is unbounded). (We'll also add a check in parsing to ensure that for the range to be well formed `a <= b`.) As it turns out, the most straightforward way to implement this isn't by recursing. Doing that means we end up needing to take special care to handle the suffix and the footnote, we have to build up an extra `Expression` we don't need, and we have to `unwrap` the call. Instead, it works better to treat this lowering in the manner of a transitioning state machine in the spirit of `loop match` as proposed in RFC 3720.

Update books ## rust-lang/book 1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88 2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC - Ch01+ch02 after tech review (rust-lang/book#4329) ## rust-lang/edition-guide 2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea 2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC - fix grammar errors (rust-lang/edition-guide#374) - remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375) ## rust-lang/nomicon 2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79 2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC - Remove double wording in opaque type chapter (rust-lang/nomicon#487) - remove `rust-intrinsic` ABI (rust-lang/nomicon#485) ## rust-lang/reference 6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed 2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC - Add a new grammar renderer (rust-lang/reference#1787) - Misc. spelling fixes (rust-lang/reference#1785) - Fix std::ops links in range-expr (rust-lang/reference#1786) - traits.md: remove unusual formatting (rust-lang/reference#1784) - doc: add missing space (rust-lang/reference#1782) - spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)

Rollup merge of rust-lang#139884 - rustbot:docs-update, r=ehuss Update books ## rust-lang/book 1 commits in 45f05367360f033f89235eacbbb54e8d73ce6b70..d33916341d480caede1d0ae57cbeae23aab23e88 2025-04-08 18:24:27 UTC to 2025-04-08 18:24:27 UTC - Ch01+ch02 after tech review (rust-lang/book#4329) ## rust-lang/edition-guide 2 commits in 1e27e5e6d5133ae4612f5cc195c15fc8d51b1c9c..467f45637b73ec6aa70fb36bc3054bb50b8967ea 2025-04-15 19:49:59 UTC to 2025-04-11 15:27:31 UTC - fix grammar errors (rust-lang/edition-guide#374) - remove the unused and deprecated `multilingual` field from `book.toml` (rust-lang/edition-guide#375) ## rust-lang/nomicon 2 commits in b4448fa406a6dccde62d1e2f34f70fc51814cdcc..0c10c30cc54736c5c194ce98c50e2de84eeb6e79 2025-04-09 01:54:42 UTC to 2025-04-07 20:22:31 UTC - Remove double wording in opaque type chapter (rust-lang/nomicon#487) - remove `rust-intrinsic` ABI (rust-lang/nomicon#485) ## rust-lang/reference 6 commits in 46435cd4eba11b66acaa42c01da5c80ad88aee4b..3340922df189bddcbaad17dc3927d51a76bcd5ed 2025-04-15 19:03:24 UTC to 2025-04-10 01:56:25 UTC - Add a new grammar renderer (rust-lang/reference#1787) - Misc. spelling fixes (rust-lang/reference#1785) - Fix std::ops links in range-expr (rust-lang/reference#1786) - traits.md: remove unusual formatting (rust-lang/reference#1784) - doc: add missing space (rust-lang/reference#1782) - spelling fix, Discrimants -> Discriminants (rust-lang/reference#1783)

In rust-lang#1787 I missed linkify-ing references to grammar rules that weren't links. This makes sure that they are linked and validated.

ehuss added 18 commits April 10, 2025 14:31

Add some missing Syntax markers

8d1e7bd

Add missing rules for syntax blocks

acd95af

Fix some minor grammar formatting issues

f208204

Just fixing some small consistency and spacing mistakes.

Fix CfgAttribute name

b53a9ee

This rule was misnamed, colliding with the existing CfgAttrAttribute.

Name common ascii control characters

c1faa76

This removes all backslash escaped characters. This helps to avoid confusing similarities with a literal backslash followed by a character versus the interpreted escaped character.

Remove "followed by" in STRING_CONTINUE

94acc5e

I don't exactly know why this was placed there, but we operate under the assumption that all lexical characters immediately follow one another.

Introduce a new "prose" terminal

86a49fc

This introduces a new terminal kind that I'm calling a "prose" which describes what the terminal is. This is inspired by the IETF format which uses angle brackets to describe terminals in English.

Normalize suffix capitalization

a547e37

The grammar almost always uses lowercase, so let's standardize on that.

Remove parentheses around suffixes

36c8d99

This helps to standardize how suffixes are written. Normally they do not use parentheses, and visually I don't think they entirely necessary.

Fix nonterminals of ConstParam

35c098a

These two nonterminals were using the wrong name for the productions for BlockExpression and LiteralExpression.

Fix dyn edition presentation

963339e

Per our style, edition differences are supposed to be separated out into an edition block.

Add grammar rule for XID_Start and XID_Continue

1c65870

These were defined in prose below, but defining them here allows us to easily refer and link to them.

Add a grammar rule for reserved tokens

a8e1afb

This adds a grammar rule that collects all the reserved token forms into a single production rule so that we can define what a "token" is by referring to this.

Define the Token rule

13996e6

This defines a Token in the grammar so that we can easily refer to it (and to make it easier to see what all the tokens are).

Remove escape rule

65febd6

We no longer represent characters via escape sequences. These can be confused with the literal two bytes of backslash followed by a character. See the "common productions" list for how these are now referred to.

rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Apr 10, 2025

ehuss force-pushed the railroad-grammar branch from a8be867 to 4a8e44f Compare April 11, 2025 01:23

ehuss added 6 commits April 11, 2025 08:53

Introduce a new grammar renderer

2baaa05

This adds an extension to mdbook-spec that will parse code-blocks in a BNF-style grammar into a rendered format, in both markdown or as railroad diagrams.

Add the javascript hooks for handling the new railroad grammar

216bd24

This adds the hooks to toggle the visibility of the railroad grammar. The status is stored in localstorage to keep it sticky.

Add styling for the new grammar and railroad diagrams

ab8d215

Add a summary chapter that shows all of the grammar productions on on…

6c55e50

…e page

Add some documentation for how to write grammar rules

ea629b4

Fix rule reference links with multiple spaces

a954c17

This fixes it so that rule links work correctly if there is more than one space in a reference definition.

traviscross reviewed Apr 14, 2025

View reviewed changes

mdbook-spec/src/grammar/render_railroad.rs Outdated Show resolved Hide resolved

traviscross reviewed Apr 14, 2025

View reviewed changes

mdbook-spec/src/grammar.rs Outdated Show resolved Hide resolved

traviscross force-pushed the railroad-grammar branch from c907258 to a2515e4 Compare April 14, 2025 20:57

ehuss force-pushed the railroad-grammar branch from a2515e4 to bb5862e Compare April 14, 2025 21:18

traviscross added 5 commits April 14, 2025 21:27

Replace a match with let-else

4d69e5a

We can save a line by replacing this `match` with a `let-else`, so let's do that.

Remove support for reversing railroad elements

8a37649

We no longer need to reverse the elements anywhere in our railroad diagrams, so let's remove the supporting infrastructure for doing this.

traviscross force-pushed the railroad-grammar branch from 3778580 to 8a37649 Compare April 14, 2025 21:28

traviscross approved these changes Apr 15, 2025

View reviewed changes

ehuss added this pull request to the merge queue Apr 15, 2025

Merged via the queue into rust-lang:master with commit 3340922 Apr 15, 2025
5 checks passed

rustbot mentioned this pull request Apr 15, 2025

Update books rust-lang/rust#139884

Merged

est31 mentioned this pull request Apr 23, 2025

Document let_chains again #1740

Merged

ehuss mentioned this pull request Mar 6, 2024

Clean up and consolidate the lexical specification. #567

Open

6 tasks

ehuss added a commit to ehuss/reference that referenced this pull request Jun 11, 2025

Convert remaining grammar rule references to links

1eb544a

In rust-lang#1787 I missed linkify-ing references to grammar rules that weren't links. This makes sure that they are linked and validated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a new grammar renderer #1787

Add a new grammar renderer #1787

Uh oh!

ehuss commented Apr 10, 2025 •

edited

Loading

Uh oh!

lukaslueg commented Apr 11, 2025

Uh oh!

ia0 commented Apr 11, 2025

Uh oh!

lukaslueg commented Apr 11, 2025

Uh oh!

ia0 commented Apr 11, 2025

Uh oh!

traviscross commented Apr 14, 2025 •

edited

Loading

Uh oh!

lukaslueg commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

traviscross commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

Add a new grammar renderer #1787

Add a new grammar renderer #1787

Uh oh!

Conversation

ehuss commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukaslueg commented Apr 11, 2025

Uh oh!

ia0 commented Apr 11, 2025

Uh oh!

lukaslueg commented Apr 11, 2025

Uh oh!

ia0 commented Apr 11, 2025

Uh oh!

traviscross commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukaslueg commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

traviscross commented Apr 14, 2025

Uh oh!

Uh oh!

Uh oh!

ehuss commented Apr 10, 2025 •

edited

Loading

traviscross commented Apr 14, 2025 •

edited

Loading