-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer interface mega-PR #220
Conversation
Yay! Initial thoughts, in no particular order:
In response to your last two points — and this might be a silly idea — but how about making the @lexer directive cause |
Something like 4x, on this one tiny example:
I don't really want to change Moo's API at this point.
As well as restoring the line/col info, it's resetting the Lexer's internal buffer. It's plausible that the Lexer and Parser could become out-of-sync, for example if the parser throws an error before reaching the end of the input. (Thinking about it, the built-in ChunkLexer might have a problem with that. I shall test this.)
It could be, but I thought the idea was that Moo should be built-in to Nearley? Or does having it as a dependency not allow users to
I think that's a very silly idea :P Matching string literals against token |
What does nearley do if you feed() some more after it throws an error? Is nearley's behaviour well-defined in that case? |
Note that
|
\o/
Right, but in this case, you wouldn't ever have to call
Why not? It's just renaming a method. I think the name
Right, I don't think you can require modules recursively just like that (though you never know with npm…). In any case, it's less confusing if the user explicitly installs the lexer they want to use.
:-( Okay. |
This, and my confusion around character indexes vs. token indexes, has convinced me that the rewind() API is broken and we should remove it if at all possible.
Isn't I could probably be persuaded to add a If the interface bothers you particularly, then we could instead write a light built-in wrapper around moo, and require people to use that. |
If you really are convinced, we should do that before people realize it exists/start using it!
Sounds good to me. |
Oh well :) What would be the new incremental parsing solution? Because using nearley on like 100 kb files and suboptimal grammars is already kinda painful. And, indeed,
this is really bad when the rewinding/feeding is used in the text editor, even at line granularity -- editor does not know where tokens start/end at all. |
My use case for the rewinding feature will be a custom CodeMirror mode for your Nearley grammar. CodeMirror gives you a line at a time, so you just have to design your lexer not to use multi-line tokens. :-) |
Well, one obvious such token is "whitespace" allowing for multiple line breaks between statements isnt it? Or am I missing something here? |
You can always rewrite your tokenizer to have a separate newline token that matches a single newline at a time. Makes your grammar a bit more complicated, but it's still better than having no tokenizer at all! Anyway; there will be some solution for resuming parsers, but it will probably look more like save()/restore() than it does rewind(). :-) |
Ah okay, save/restore seems fine for me. Of course it smells like O(n^2) memory per file (because each line remembers the preceding lines) but will still allow for reasonable interactivity. |
Why? It's linear in the number of lines. It's actually less memory usage than See #221 for further discussion on this. :-) |
Modulo some polish and testing, I think this now does everything I want. :) |
You can now parser.feed(line)
var thing = parser.save()
parser.feed(anotherLine)
// hold on, let's revert that
parser.restore(thing) This lets you control memory usage more precisely—it doesn't require You still can't have tokens which overlap chunk boundaries, I'm afraid; it turns out to be way too complicated to handle in a RegExp-based tokenizer (see no-context/moo#43 if you're curious). In practice, this means you'll probably end up feed()-ing one line at a time, and having a single token in your lexer to handle newlines. |
I just tested again on a 1k JSON sample: ~7 -> 115 ops/sec. 🙃 |
This avoids the need for periods in %-specifiers. If we see a @lexer directive, use Lexer#has() to look up token names. Only if that returns falsy do we look up a local variable of the same name. :-)
sample1k.json benchmark: 4 ops/s -> 61 ops/s
Closes #188. * Add Lexer#formatError() for this purpose.
I re-did this PR to be more comprehensible (rebased the commits to remove my false starts!) |
Let's merge this and see if anyone complains! :D |
@lexer
directive to export a Lexer from a grammar file@lexer
directive disables string expansion, so"false"
matches a token with the value "false". (Allow turning off string expansion #212)See README updates for some documentation on the lexer interface. (I fully expect @Hardmath123 to rewrite the readme once I'm done with this 😉)
As a replacement for #198—which I no longer want to do—I've changed the semantics of token matching, and added a fourth kind of specifier,
{type:}
.String
: this specifier matches a non-terminal.{literal: String}
: this matches thevalue
of the token returned by your Lexer.{type: String}
: this matches thetype
of the token. If you're not using a custom lexer, this doesn't make any sense.{test: Token -> Boolean}
: if you're using a custom Lexer, we pass the whole token object totest()
, so you can inspect any of its attributes. However, if you're using the built-inChunkLexer
, then you get just thevalue
of the token, to be consistent with the existing behaviour.Performance: the Lexer interface adds a tiny overhead to scannerless parsers, but is probably vastly outweighed by the benefit of pushing everyone toward using tokenizers. :-)
@lexer
disables EBNF string expansion (#212). This means generating literal specifiers is very natural and convenient: