Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TypeScript] Add methods for BufferedTokenStream #4569

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions runtime/JavaScript/src/antlr4/BufferedTokenStream.d.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,27 @@
import { TokenStream } from './TokenStream';
import { Lexer } from "./Lexer";
import { TokenSource } from './TokenSource';
import { Token } from './Token';

export declare class BufferedTokenStream extends TokenStream {

tokenSource: Lexer;
tokenSource: TokenSource;
tokens: Token[];
index: number;
fetchedEof: boolean;

constructor(source: TokenSource);
setTokenSource(tokenSource: TokenSource): void;
mark(): number;
release(marker: number): void;
reset(): void;
seek(index: number): void;
consume(): void;
sync(i: number): boolean;
fetch(n: number): number;
LB(k: number): Token;
nextTokenOnChannel(i: number, channel: number): number;
previousTokenOnChannel(i: number, channel: number): number;

protected lazyInit(): void;
Copy link
Contributor

@ericvergnaud ericvergnaud Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need these protected members ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjustSeekIndex is required for subclassing BufferedTokenStream, and it only exists for subclasses to use. See CommonTokenStream::adjustSeekIndex for an example.

lazyInit is needed if the subclass tries to access any of the fields/properties safely, possibly before one of the BufferedTokenStream methods is called. Technically more protected functions are available, like setup, but these two were the minimum required to correctly subclass the stream. Technically all the fields should be marked protected but that felt too restrictive, because the missing underscore convention (_index, _tokens) indicates the fields are public in JS.

I've actually made a mistake, there should be an additional index: number line. I'll push that to the PR branch tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do you need to subclass BufferedTokenStream ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I needed a token stream that did extra processing/rewriting, between the lexer and parser steps. It's a normal/supported thing to do, the Parser accepts any TokenStream as input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that normally done by a TokenStreamRewriter, designed just for that purpose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenStreamRewriter is made with a C-style preprocessor in mind. Its output is text after rewrites have happened. It's not able to e.g. transform tokens, merge them together or split them apart, add context to tokens (technically subclass, but JS makes that very blurry), and generally fails to act as a stream transformer between the lexer and parser. You would need to run a file through a lexer, then the token stream rewriter, then through a different lexer, then through a parser. I'm not dealing with a preprocessor, I need a token stream transformer that outputs a stream of tokens.

I think this is a supported use case; otherwise there would be no need for a distinction between CommonTokenStream and BufferedTokenStream.

Copy link
Contributor Author

@Phlosioneer Phlosioneer Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ground this discussion, we're discussing two protected annotations in a typescript definition file as a companion to the already-implemented, already-publicly-accessible javascript functions. The ship has already sailed about whether these functions were a "good" interface or if the buffer is a "good" extension point. These functions are already part of the public API on every other language runtime except typescript.

Also, it should be noted that other runtimes put lazyInit as protected, never private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.

Copy link
Contributor Author

@Phlosioneer Phlosioneer Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know antlr was getting another rewrite. I'll detail my use case, perhaps it will help. However, yesterday I rewrote the whole parser & lexer from scratch and achieved much better results than antlr was able to provide. This is for a personal project. Read only if you're interested, nothing here directly bears on the current protected member discussion.

The DSL I made has identifiers / names of the form foo.bar.baz, dot-separated paths. It also allows a shorthand starting with a dot, .bar.baz, that meant the identifier was relative to some other identifier. There are block statements like with foo { ... .bar.baz ... } that should result in foo.bar.baz at the end of parsing.

OK, all fairly normal so far. The issue is that with blocks could be anywhere in the file. Inside any block or at the top level or nested within other with statements. I tried to spec this out in the parser's .g4 but my simple ~60 line parser ballooned into a few hundred lines, to account for all the possible places where with could happen; and the parser-listener became similarly complex. And that was still with a subset of the intended DSL.

What I really needed was something like with_generic<inner_parser>: 'with' (IDENT | REL_IDENT) '{' inner_parser* '}'; but that didn't seem possible. So I tried to move the handling of with into its own parser / transformer. The rule is extremely simple if stated as its own grammar; it's nearly find-and-replace. I tried to use the provided stream-rewriter, but the translation back to text destroyed metadata that I was tracking in the tokens, particularly the IDENT / REL_IDENT tokens, about the path they represent. I can see the utility of that rewriter for large complex transformations, but this task is extremely simple.

So I wrote a class that took in a token stream, located with blocks, and emitted corrected IDENT tokens. No new tokens were added to the stream, so the line, column, and character indexes of each token were still valid. The text was also still valid; the relative-to information was stored as a separate field. But then I ran into the above issue, where the Parser would not accept just any input stream, only a buffered input stream; and the buffered input stream wouldn't accept an arbitrary token stream, just a Lexer. So I made adjustments until it worked.

The DSL is something I made and could adjust, but it's being used as a game/puzzle element directly, so I couldn't just remove or dramatically change how the with blocks worked. What eventually drove me to rewrite the lexer & parser was frustration over the lexer not tokenizing some IDENT tokens (and not understanding how to properly handle error messages). Turns out, the grammar was actually ambiguous to lex, but I only understood that after rewriting. The lexer and parser ended up being less code, though; the full power and flexibility of antlr wasn't needed. The DSL is LL(1) with finite depth, so the state machine is extremely simple. And with access to generic functions, the with situation could be handled simply in the parser with a single rule-function (parseWith<T, U>(innerParser: (data: U) => T, data: U): void) invoked within the generic block rule-function (parseBlock<T, U>(statementParser: (data: U) => T, data: U): void which corresponds to antlr pseudocode parse_block<inner>: '{' (with_block<inner> | inner)* '}';). The most complex thing I had to do was implementing a prefix tree for lexing keywords. I probably could have just done a hash map with string keys but it was a fun challenge.


I took a look at the antlr5 repo. I like the concept. Though webassembly seems to be splitting in two with the WASI stuff, even two very similar runtimes is far better than 10 unrelated ones. I'll poke around a bit, maybe try to get involved.

protected adjustSeekIndex(i: number): number;
}
6 changes: 5 additions & 1 deletion runtime/JavaScript/src/antlr4/TokenSource.d.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
export declare class TokenSource {
import {Token} from "./Token";

export declare class TokenSource {
_factory: any;
nextToken(): Token;
getSourceName(): string;
}
Loading