Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer interface mega-PR #220

Merged
merged 8 commits into from
Apr 24, 2017
Merged

Lexer interface mega-PR #220

merged 8 commits into from
Apr 24, 2017

Conversation

tjvr
Copy link
Collaborator

@tjvr tjvr commented Mar 28, 2017


See README updates for some documentation on the lexer interface. (I fully expect @Hardmath123 to rewrite the readme once I'm done with this 😉)

As a replacement for #198—which I no longer want to do—I've changed the semantics of token matching, and added a fourth kind of specifier, {type:}.

  • String: this specifier matches a non-terminal.
  • {literal: String}: this matches the value of the token returned by your Lexer.
  • {type: String}: this matches the type of the token. If you're not using a custom lexer, this doesn't make any sense.
  • {test: Token -> Boolean}: if you're using a custom Lexer, we pass the whole token object to test(), so you can inspect any of its attributes. However, if you're using the built-in ChunkLexer, then you get just the value of the token, to be consistent with the existing behaviour.

Performance: the Lexer interface adds a tiny overhead to scannerless parsers, but is probably vastly outweighed by the benefit of pushing everyone toward using tokenizers. :-)


@lexer disables EBNF string expansion (#212). This means generating literal specifiers is very natural and convenient:

value ->
      object
    | array
    | %NUMBER
    | %STRING
    | "true" {% () => true %}
    | "false" {% () => false %}
    | "null" {% () => null %}

@kach
Copy link
Owner

kach commented Mar 28, 2017

Yay!

Initial thoughts, in no particular order:

  • I think moo should be a dev dependency
  • Let's name it lexer.new() instead of lexer.clone().
  • Do you by any chance have benchmarks on how much faster JSON is with a lexer?
  • In reset("", Info), it's unclear why it takes an empty string as the first argument. We might want to rename it restore(Info) to match Canvas.

In response to your last two points — and this might be a silly idea — but how about making the @lexer directive cause "foo" to turn into {type: "foo"}. This solves both your problems. :P

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

Do you by any chance have benchmarks on how much faster JSON is with a lexer?

Something like 4x, on this one tiny example:

  ✔ json test2.json x 3,455 ops/sec ±1.37% (90 runs sampled)
  ✔ json test2.json x 13,172 ops/sec ±0.89% (88 runs sampled)

Let's name it lexer.new() instead of lexer.clone().

I don't really want to change Moo's API at this point.

In reset("", Info), it's unclear why it takes an empty string as the first argument.

As well as restoring the line/col info, it's resetting the Lexer's internal buffer. It's plausible that the Lexer and Parser could become out-of-sync, for example if the parser throws an error before reaching the end of the input. (Thinking about it, the built-in ChunkLexer might have a problem with that. I shall test this.)

I think moo should be a dev dependency

It could be, but I thought the idea was that Moo should be built-in to Nearley? Or does having it as a dependency not allow users to npm install nearley and then require('moo')?

how about making the @lexer directive cause "foo" to turn into {type: "foo"}

I think that's a very silly idea :P Matching string literals against token value is IMHO very natural; and it greatly simplifies converting a scannerless parser into one with a lexer. It means that more things Just Work: example.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

What does nearley do if you feed() some more after it throws an error? Is nearley's behaviour well-defined in that case?

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

Note that Lexer#feed() has the same problems as no-context/moo#36 — notably:

@nathan:

It is up to the user to push data at token boundaries.

@kach
Copy link
Owner

kach commented Mar 28, 2017

Something like 4x

\o/

As well as restoring the line/col info, it's resetting the Lexer's internal buffer.

Right, but in this case, you wouldn't ever have to call .reset() with something besides "" as the first argument, right? Because the parser always rewinds to token boundaries.

I don't really want to change Moo's API at this point.

Why not? It's just renaming a method. I think the name .new() makes much more sense than .clone(), especially since this is the generic nearley <=> lexer interface.

Or does having it as a dependency not allow users to npm install nearley and then require('moo')?

Right, I don't think you can require modules recursively just like that (though you never know with npm…). In any case, it's less confusing if the user explicitly installs the lexer they want to use.

I think that's a very silly idea :P

:-(

Okay.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

Because the parser always rewinds to token boundaries.

This, and my confusion around character indexes vs. token indexes, has convinced me that the rewind() API is broken and we should remove it if at all possible.


I think the name .new() makes much more sense than .clone(), especially since this is the generic nearley <=> lexer interface.

Isn't new a keyword? Regardless, in the context of moo I'm not going to rename it, since cloning the Lexer object is exactly what it does (it's a method on Lexer which returns a separate Lexer, so you can tokenise more than one thing at one using the same lexer definition).

I could probably be persuaded to add a restore() method to Moo :-)
(My reservation is that it's not clear whether or not restore() will reset the Lexer's internal buffer.)

If the interface bothers you particularly, then we could instead write a light built-in wrapper around moo, and require people to use that.

@kach
Copy link
Owner

kach commented Mar 29, 2017

the rewind() API is broken and we should remove it if at all possible

If you really are convinced, we should do that before people realize it exists/start using it!

we could instead write a light built-in wrapper around moo

Sounds good to me.

@wizzard0
Copy link

If you really are convinced, we should do that before people realize it exists/start using it!

Oh well :) What would be the new incremental parsing solution? Because using nearley on like 100 kb files and suboptimal grammars is already kinda painful.

And, indeed,

It is up to the user to push data at token boundaries.

this is really bad when the rewinding/feeding is used in the text editor, even at line granularity -- editor does not know where tokens start/end at all.

tjvr added a commit that referenced this pull request Mar 29, 2017
As discussed on #220.
@tjvr
Copy link
Collaborator Author

tjvr commented Mar 29, 2017

@Hardmath123: 8c33475

this is really bad when the rewinding/feeding is used in the text editor, even at line granularity

My use case for the rewinding feature will be a custom CodeMirror mode for your Nearley grammar. CodeMirror gives you a line at a time, so you just have to design your lexer not to use multi-line tokens. :-)

@wizzard0
Copy link

Well, one obvious such token is "whitespace" allowing for multiple line breaks between statements isnt it? Or am I missing something here?

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 29, 2017

You can always rewrite your tokenizer to have a separate newline token that matches a single newline at a time. Makes your grammar a bit more complicated, but it's still better than having no tokenizer at all!

Anyway; there will be some solution for resuming parsers, but it will probably look more like save()/restore() than it does rewind(). :-)

@wizzard0
Copy link

Ah okay, save/restore seems fine for me. Of course it smells like O(n^2) memory per file (because each line remembers the preceding lines) but will still allow for reasonable interactivity.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 29, 2017

it smells like O(n^2) memory per file

Why? It's linear in the number of lines. It's actually less memory usage than rewind() was, since that required keepHistory, which has memory usage linear in the number of tokens...

See #221 for further discussion on this. :-)

@tjvr tjvr mentioned this pull request Mar 29, 2017
@tjvr tjvr changed the title Add Lexer interface Lexer mega-PR Mar 30, 2017
@tjvr tjvr changed the title Lexer mega-PR Lexer interface mega-PR Mar 30, 2017
@tjvr tjvr requested a review from kach March 30, 2017 11:01
@tjvr
Copy link
Collaborator Author

tjvr commented Mar 30, 2017

Modulo some polish and testing, I think this now does everything I want. :)

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 30, 2017

What would be the new incremental parsing solution?

You can now save() and restore():

parser.feed(line)
var thing = parser.save()

parser.feed(anotherLine)
// hold on, let's revert that
parser.restore(thing)

This lets you control memory usage more precisely—it doesn't require keepHistory—and it plays nicely with lexer integration. I've added back rewind(), too, so we don't have to violate semver :-)

You still can't have tokens which overlap chunk boundaries, I'm afraid; it turns out to be way too complicated to handle in a RegExp-based tokenizer (see no-context/moo#43 if you're curious). In practice, this means you'll probably end up feed()-ing one line at a time, and having a single token in your lexer to handle newlines.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 30, 2017

\o/

I just tested again on a 1k JSON sample: ~7 -> 115 ops/sec. 🙃

@tjvr tjvr requested review from kach and removed request for kach April 5, 2017 11:38
tjvr added 5 commits April 17, 2017 15:39
This avoids the need for periods in %-specifiers.

If we see a @lexer directive, use Lexer#has() to look up token names. Only if that returns falsy do we look up a local variable of the same name. :-)
sample1k.json benchmark: 4 ops/s -> 61 ops/s
Closes #188.

* Add Lexer#formatError() for this purpose.
@tjvr
Copy link
Collaborator Author

tjvr commented Apr 17, 2017

I re-did this PR to be more comprehensible (rebased the commits to remove my false starts!)

@tjvr tjvr mentioned this pull request Apr 19, 2017
@tjvr
Copy link
Collaborator Author

tjvr commented Apr 24, 2017

Let's merge this and see if anyone complains! :D

@tjvr tjvr merged commit 5f37e85 into master Apr 24, 2017
@tjvr tjvr deleted the moo branch June 22, 2017 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants