Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in Tokenizer #166

Closed
tjvr opened this issue Feb 16, 2017 · 29 comments · May be fixed by GerHobbelt/nearley#15
Closed

Built-in Tokenizer #166

tjvr opened this issue Feb 16, 2017 · 29 comments · May be fixed by GerHobbelt/nearley#15

Comments

@tjvr
Copy link
Collaborator

tjvr commented Feb 16, 2017

We should consider including a tokenizer in nearley.

@bd82
Copy link
Contributor

bd82 commented Feb 17, 2017

Hello.

I've created a feature rich and fast tokenizer as part of another parsing toolkit.
I can investigate the effort required to separate/publish it as an independent module (Tokenizer only).
If it is of interest for reuse/integration with nearley?

Example Json Lexer definition (only look at Lines 9-23)

Features:

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 1, 2017

Ideally our tokenizer, or a wrapper for it, should support the same API as Parser— e.g. feed(), rewind()—so that you can use it as a drop-in replacement for a scannerless Parser.

This is primarily motivated by the nearley-codemirror project I'm planning... :-)

@bd82
Copy link
Contributor

bd82 commented Mar 1, 2017

I'm assuming feed(x) means to tokenize the input x.
So thats an easy API wrapper to create.

What does rewind do? What use cases is it for? particularly in the context of a lexer?

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 1, 2017

@bd82 See #165.

feed() gives the parser a chunk of source code. rewind() resets the parser back to a previous position. These are both seriously useful for applications such as syntax highlighting, because you can cache previous parses

(note: I'm talking about syntax highlighting that requires the use of a parser, which is fairly rare — see tosh for an example language that does this)

@bd82
Copy link
Contributor

bd82 commented Mar 1, 2017

All-right.

For a Lexer is not rewinding to a previous position much simpler than for a parser given that it normally a flat structure?

If the lexer input is a token vector, cannot the previous positions be computed by using the offset of the
tokens and calling substring on the original input?

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 11, 2017

@Hardmath123 I've been working on adding our new standalone tokenizer library, moo. I've found some issues with your proposed scheme, and I want your thoughts...

So your idea was to define the lexer inside the .ne file.

@{%

var Lexer = (typeof module === 'object' && module.exports) ?  require('./lexer') : nearley.Lexer;
var l = Lexer([
    ['string', /"((?:[^\\"\n]|\\[]|\\u[0-9a-fA-F]{4})*?)"/],
    ['whitraw', /(?:\s|\n)+/],
    ['comment', /#[^\n]*?$/],
    ['word', /[\w\?\+]+/],
])
%}
  • How can we then get it out of the grammar file? I can't see a way to export it.
  • requiring nearley.Lexer here seems icky, but doable.
  • I'm writing nearley.Lexer as a wrapper around moo, to add the nice things you wanted (like .test() and line number tracking). :-)

@kach
Copy link
Owner

kach commented Mar 11, 2017

I think nearley.Lexer could just live within nearley.js, if it's not too big. And the lexer could either (1) be defined within the grammar file, or (2) be defined elsewhere and imported — same story as postprocessors.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 11, 2017

How do you import postprocessors?

@kach
Copy link
Owner

kach commented Mar 11, 2017

Just like how you'd import anything else: require() in node, or add the code in a <script> tag on the browser. At least, that's how people do it right now.

@bates64
Copy link
Contributor

bates64 commented Mar 19, 2017

Just published nearley-moo - a simple-ish wrapper for moo and nearley that automagically feeds tokens to the grammar. Here's an example.

@Hardmath123 is it possible to define variables within the grammar.js scope through another file without polluting global? Making %ident token identifiers - which don't support property reference - reference tokens.ident as a fallback, provided ident itself is not defined would be a great help: const tokens = nm(require('./tokens')) currently nm pollutes global || window which isn't great!

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 20, 2017

@NanaLan It's not clear what you're asking. But I think what you want is to allow periods in %specs.

Hardmath—we should definitely allow periods in specs. I also think we should merge #198. Then we can think about adding a built-in Moo wrapper to Nearley.

@kach
Copy link
Owner

kach commented Mar 20, 2017

Yeah, periods in %token names is probably unavoidable.

I'll go remind myself what #198 was all about.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 25, 2017

We need to be able to define a lexer inside a .nefile, and export it from the complied grammar so it's available for the parser wrapper to use.

I propose we add some sort of @export directive for this. Then you can define a moo lexer in the usual way, and make it available for the parser. Though the parser needs to know whether or not to use a custom tokeniser... perhaps a @lexer directive?

@kach
Copy link
Owner

kach commented Mar 25, 2017

How important do you think it is for lexers to be defined within the .ne file?

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 25, 2017

Very. I want to keep all the language-related stuff in one place.

@kach
Copy link
Owner

kach commented Mar 25, 2017

Okay, and you think it should be included as raw JS instead of some sort of moo-optimized lexer DSL? I guess that makes sense if we want to support non-moo lexers as well.

In that case, an @lexer directive might work well: the user calling .feed() might not even need to worry about the lexer at all because if nearleyc knows about the lexer, then .feed() can do the right thing automagically. No need to export anything.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

an @lexer directive might work well

What would this look like? Would the argument to @lexer be expected to follow some sort of nearley-lexer interface?

@kach
Copy link
Owner

kach commented Mar 27, 2017

Yeah, I'm thinking you do

{%
var foolexer = moo.compile(…);
%}

@lexer foolexer

The argument to @lexer would be a variable pointing to an object with properties to (1) feed a string, (2) get the next token, (3) do what "matcher" does in #198. Everything else can then be wired up in nearley.js so that it Just Works for users.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

var foolexer = moo.compile(…);

Surely nearley.lexer(...) or some such? The interface moo exposes isn't the same as the one you describe.

@kach
Copy link
Owner

kach commented Mar 27, 2017

Err, isn't it? lexer.feed() feeds a string, lexer.next() is the next token, and we can add a lexer.match() pretty easily. I guess we need lexer.reset(), too.

I was hoping we could standardize the interface around nearley<=>moo, and then other lexers can write simple wrappers to comply with that interface.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

moo also supports stateful levers, so the interface needs to be a bit more complicated than that if we want rewind() to keep working. :-)

@kach
Copy link
Owner

kach commented Mar 27, 2017

Uhhh… yes. Yikes. How would that even… work? :-/

So, another option is to define the lexer in a standard .js file and pass it in as an option when you create a new parser. We might be able to work out some magic to make token names available in the .ne file.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

Have a look at https://github.com/tjvr/moo#states and https://github.com/tjvr/moo#reset — calling save() on a lexer gives you back an object describing the lexer state. We can do this after every token and store the result on Column; then rewind() can use that state to reset() the lexer.

Standardising on some interface does actually seem reasonable; there has to be some interface, and it doesn't really matter if it's moo's interface or a special nearley-specific one. So I think that plan of yours is best.

(Ideally the rewind() API might look a bit more like moo's save/reset API; I might do that sometime in the future. I think I could re-implement rewind() in terms of Parser#reset().)

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

FWIW, I don't think Lexer#match() is something we should add to moo, but that's kind of an irrelevant detail.

BTW, would it be okay to rewrite nearley-bootstrapped to use moo? It's probably the easiest way for me to test this, and would conveniently solve #188 :-)

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 27, 2017

Here are the things I think we need:

  • Add new option match: (spec, token) -> bool (i.e. Make token matching semantics user-definable #198)
  • Add require() directive (so we can access moo, or any other tokeniser, from inside the .ne JavaScript)
  • Add lexer() directive: return an object with methods:
    • clone() -> new Lexer
    • save() -> Info
    • reset("", Info) -> restore line/col/state info (for rewind)
    • feed(chunk: String)
    • next() -> {type, value, line, col, …}
  • Allow dots . in token specifiers %lex.string
  • Have some way of auto-generating custom specifiers (some nearley-specific helper?) Or can we change the meaning, so that %string compiles to {type: "string"}?
  • (Optional, but nice) Support disabling string expansion, and instead matching spec.literal against token.value.

The necessity of the @require & @lexer directives makes me feel like we're trying to re-invent JS/Node inside of our .ne files. I've mentioned before the idea of using ES6 template literals instead, and I still think it's a good one; however I'm guessing you're not yet ready to drop nearleyc entirely @Hardmath123? :-)

@kach
Copy link
Owner

kach commented Mar 28, 2017

Have a look at https://github.com/tjvr/moo#states and https://github.com/tjvr/moo#reset — calling save() on a lexer gives you back an object describing the lexer state. We can do this after every token and store the result on Column; then rewind() can use that state to reset() the lexer.

That sounds expensive memory-wise. How important is rewinding-with-lexers, anyway?

FWIW, I don't think Lexer#match() is something we should add to moo, but that's kind of an irrelevant detail.

Where else would that functionality be? Someone needs to know how to check whether a token is of a particular type; it makes sense for the lexer to be the one to know.

The necessity of the @require & @lexer directives makes me feel like we're trying to re-invent JS/Node inside of our .ne files.

I think the conclusion here isn't "let's replace nearleyc", but rather "let's be more deliberate about the scope of nearleyc".

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

Where else would that functionality be?

The responsibility for match() should be the same thing which is responsible for generating the token specifiers.

@tjvr
Copy link
Collaborator Author

tjvr commented Mar 28, 2017

That sounds expensive memory-wise.

Nah. Particularly compared to all the Column/Item data keepHistory is already, um, keeping.

@tjvr tjvr self-assigned this Mar 29, 2017
@tjvr
Copy link
Collaborator Author

tjvr commented Aug 9, 2017

This is totally resolved by now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants