Built-in Tokenizer #166

tjvr · 2017-02-16T22:59:02Z

We should consider including a tokenizer in nearley.

bd82 · 2017-02-17T00:30:58Z

Hello.

I've created a feature rich and fast tokenizer as part of another parsing toolkit.
I can investigate the effort required to separate/publish it as an independent module (Tokenizer only).
If it is of interest for reuse/integration with nearley?

Example Json Lexer definition (only look at Lines 9-23)

Features:

Based on JavaScript RegExps.
No code generation required.
Full start and end Line/Column/offset tracking.
Keywords vs Identifiers
Lexer Modes
Token Grouping (Collecting comments for example).
and more...

tjvr · 2017-03-01T12:56:49Z

Ideally our tokenizer, or a wrapper for it, should support the same API as Parser— e.g. feed(), rewind()—so that you can use it as a drop-in replacement for a scannerless Parser.

This is primarily motivated by the nearley-codemirror project I'm planning... :-)

bd82 · 2017-03-01T13:44:02Z

I'm assuming feed(x) means to tokenize the input x.
So thats an easy API wrapper to create.

What does rewind do? What use cases is it for? particularly in the context of a lexer?

tjvr · 2017-03-01T13:46:15Z

@bd82 See #165.

feed() gives the parser a chunk of source code. rewind() resets the parser back to a previous position. These are both seriously useful for applications such as syntax highlighting, because you can cache previous parses

(note: I'm talking about syntax highlighting that requires the use of a parser, which is fairly rare — see tosh for an example language that does this)

bd82 · 2017-03-01T15:11:17Z

All-right.

For a Lexer is not rewinding to a previous position much simpler than for a parser given that it normally a flat structure?

If the lexer input is a token vector, cannot the previous positions be computed by using the offset of the
tokens and calling substring on the original input?

tjvr · 2017-03-11T11:30:14Z

@Hardmath123 I've been working on adding our new standalone tokenizer library, moo. I've found some issues with your proposed scheme, and I want your thoughts...

So your idea was to define the lexer inside the .ne file.

@{%

var Lexer = (typeof module === 'object' && module.exports) ?  require('./lexer') : nearley.Lexer;
var l = Lexer([
    ['string', /"((?:[^\\"\n]|\\[]|\\u[0-9a-fA-F]{4})*?)"/],
    ['whitraw', /(?:\s|\n)+/],
    ['comment', /#[^\n]*?$/],
    ['word', /[\w\?\+]+/],
])
%}

How can we then get it out of the grammar file? I can't see a way to export it.
requiring nearley.Lexer here seems icky, but doable.
I'm writing nearley.Lexer as a wrapper around moo, to add the nice things you wanted (like .test() and line number tracking). :-)

kach · 2017-03-11T15:58:05Z

I think nearley.Lexer could just live within nearley.js, if it's not too big. And the lexer could either (1) be defined within the grammar file, or (2) be defined elsewhere and imported — same story as postprocessors.

tjvr · 2017-03-11T16:02:42Z

How do you import postprocessors?

kach · 2017-03-11T18:27:59Z

Just like how you'd import anything else: require() in node, or add the code in a <script> tag on the browser. At least, that's how people do it right now.

bates64 · 2017-03-19T20:57:25Z

Just published nearley-moo - a simple-ish wrapper for moo and nearley that automagically feeds tokens to the grammar. Here's an example.

@Hardmath123 is it possible to define variables within the grammar.js scope through another file without polluting global? Making %ident token identifiers - which don't support property reference - reference tokens.ident as a fallback, provided ident itself is not defined would be a great help: const tokens = nm(require('./tokens')) ^{currently nm pollutes global || window which isn't great!}

tjvr · 2017-03-20T00:53:29Z

@NanaLan It's not clear what you're asking. But I think what you want is to allow periods in %specs.

Hardmath—we should definitely allow periods in specs. I also think we should merge #198. Then we can think about adding a built-in Moo wrapper to Nearley.

kach · 2017-03-20T01:23:01Z

Yeah, periods in %token names is probably unavoidable.

I'll go remind myself what #198 was all about.

tjvr · 2017-03-25T01:07:19Z

We need to be able to define a lexer inside a .nefile, and export it from the complied grammar so it's available for the parser wrapper to use.

I propose we add some sort of @export directive for this. Then you can define a moo lexer in the usual way, and make it available for the parser. Though the parser needs to know whether or not to use a custom tokeniser... perhaps a @lexer directive?

kach · 2017-03-25T01:12:05Z

How important do you think it is for lexers to be defined within the .ne file?

tjvr · 2017-03-25T01:15:26Z

Very. I want to keep all the language-related stuff in one place.

kach · 2017-03-25T01:20:57Z

Okay, and you think it should be included as raw JS instead of some sort of moo-optimized lexer DSL? I guess that makes sense if we want to support non-moo lexers as well.

In that case, an @lexer directive might work well: the user calling .feed() might not even need to worry about the lexer at all because if nearleyc knows about the lexer, then .feed() can do the right thing automagically. No need to export anything.

tjvr · 2017-03-27T12:56:24Z

an @lexer directive might work well

What would this look like? Would the argument to @lexer be expected to follow some sort of nearley-lexer interface?

kach · 2017-03-27T13:54:10Z

Yeah, I'm thinking you do

{%
var foolexer = moo.compile(…);
%}

@lexer foolexer

The argument to @lexer would be a variable pointing to an object with properties to (1) feed a string, (2) get the next token, (3) do what "matcher" does in #198. Everything else can then be wired up in nearley.js so that it Just Works for users.

tjvr · 2017-03-27T13:57:57Z

var foolexer = moo.compile(…);

Surely nearley.lexer(...) or some such? The interface moo exposes isn't the same as the one you describe.

kach · 2017-03-27T14:07:50Z

Err, isn't it? lexer.feed() feeds a string, lexer.next() is the next token, and we can add a lexer.match() pretty easily. I guess we need lexer.reset(), too.

I was hoping we could standardize the interface around nearley<=>moo, and then other lexers can write simple wrappers to comply with that interface.

tjvr · 2017-03-27T14:38:59Z

moo also supports stateful levers, so the interface needs to be a bit more complicated than that if we want rewind() to keep working. :-)

kach · 2017-03-27T14:54:30Z

Uhhh… yes. Yikes. How would that even… work? :-/

So, another option is to define the lexer in a standard .js file and pass it in as an option when you create a new parser. We might be able to work out some magic to make token names available in the .ne file.

tjvr · 2017-03-27T16:32:58Z

Have a look at https://github.com/tjvr/moo#states and https://github.com/tjvr/moo#reset — calling save() on a lexer gives you back an object describing the lexer state. We can do this after every token and store the result on Column; then rewind() can use that state to reset() the lexer.

Standardising on some interface does actually seem reasonable; there has to be some interface, and it doesn't really matter if it's moo's interface or a special nearley-specific one. So I think that plan of yours is best.

(Ideally the rewind() API might look a bit more like moo's save/reset API; I might do that sometime in the future. I think I could re-implement rewind() in terms of Parser#reset().)

tjvr · 2017-03-27T16:35:22Z

FWIW, I don't think Lexer#match() is something we should add to moo, but that's kind of an irrelevant detail.

BTW, would it be okay to rewrite nearley-bootstrapped to use moo? It's probably the easiest way for me to test this, and would conveniently solve #188 :-)

tjvr · 2017-03-27T21:03:09Z

Here are the things I think we need:

Add new option match: (spec, token) -> bool (i.e. Make token matching semantics user-definable #198)
Add require() directive (so we can access moo, or any other tokeniser, from inside the .ne JavaScript)
Add lexer() directive: return an object with methods:
- clone() -> new Lexer
- save() -> Info
- reset("", Info) -> restore line/col/state info (for rewind)
- feed(chunk: String)
- next() -> {type, value, line, col, …}
Allow dots . in token specifiers %lex.string
Have some way of auto-generating custom specifiers (some nearley-specific helper?) Or can we change the meaning, so that %string compiles to {type: "string"}?
(Optional, but nice) Support disabling string expansion, and instead matching spec.literal against token.value.

The necessity of the @require & @lexer directives makes me feel like we're trying to re-invent JS/Node inside of our .ne files. I've mentioned before the idea of using ES6 template literals instead, and I still think it's a good one; however I'm guessing you're not yet ready to drop nearleyc entirely @Hardmath123? :-)

kach · 2017-03-28T01:08:18Z

Have a look at https://github.com/tjvr/moo#states and https://github.com/tjvr/moo#reset — calling save() on a lexer gives you back an object describing the lexer state. We can do this after every token and store the result on Column; then rewind() can use that state to reset() the lexer.

That sounds expensive memory-wise. How important is rewinding-with-lexers, anyway?

FWIW, I don't think Lexer#match() is something we should add to moo, but that's kind of an irrelevant detail.

Where else would that functionality be? Someone needs to know how to check whether a token is of a particular type; it makes sense for the lexer to be the one to know.

The necessity of the @require & @lexer directives makes me feel like we're trying to re-invent JS/Node inside of our .ne files.

I think the conclusion here isn't "let's replace nearleyc", but rather "let's be more deliberate about the scope of nearleyc".

tjvr · 2017-03-28T01:31:47Z

Where else would that functionality be?

The responsibility for match() should be the same thing which is responsible for generating the token specifiers.

tjvr · 2017-03-28T01:32:57Z

That sounds expensive memory-wise.

Nah. Particularly compared to all the Column/Item data keepHistory is already, um, keeping.

tjvr · 2017-08-09T09:53:24Z

This is totally resolved by now!

tjvr added enhancement feedback labels Feb 16, 2017

bd82 mentioned this issue Mar 1, 2017

Benchmark page improvements Suggestions. Chevrotain/chevrotain#376

Closed

tjvr mentioned this issue Mar 25, 2017

Allow turning off string expansion #212

Closed

tjvr mentioned this issue Mar 28, 2017

Lexer interface mega-PR #220

Merged

tjvr added enhancement and removed enhancement labels Mar 29, 2017

tjvr self-assigned this Mar 29, 2017

tjvr closed this as completed Aug 9, 2017

GerHobbelt mentioned this issue Feb 3, 2024

[Snyk] Security upgrade metalsmith-layouts from 1.8.1 to 2.3.1 GerHobbelt/nearley#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in Tokenizer #166

Built-in Tokenizer #166

tjvr commented Feb 16, 2017

bd82 commented Feb 17, 2017 •

edited

Loading

tjvr commented Mar 1, 2017

bd82 commented Mar 1, 2017

tjvr commented Mar 1, 2017 •

edited

Loading

bd82 commented Mar 1, 2017

tjvr commented Mar 11, 2017

kach commented Mar 11, 2017

tjvr commented Mar 11, 2017

kach commented Mar 11, 2017

bates64 commented Mar 19, 2017 •

edited

Loading

tjvr commented Mar 20, 2017

kach commented Mar 20, 2017

tjvr commented Mar 25, 2017

kach commented Mar 25, 2017

tjvr commented Mar 25, 2017

kach commented Mar 25, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017 •

edited

Loading

tjvr commented Mar 27, 2017 •

edited

Loading

tjvr commented Mar 27, 2017

kach commented Mar 28, 2017

tjvr commented Mar 28, 2017

tjvr commented Mar 28, 2017

tjvr commented Aug 9, 2017

Built-in Tokenizer #166

Built-in Tokenizer #166

Comments

tjvr commented Feb 16, 2017

bd82 commented Feb 17, 2017 • edited Loading

tjvr commented Mar 1, 2017

bd82 commented Mar 1, 2017

tjvr commented Mar 1, 2017 • edited Loading

bd82 commented Mar 1, 2017

tjvr commented Mar 11, 2017

kach commented Mar 11, 2017

tjvr commented Mar 11, 2017

kach commented Mar 11, 2017

bates64 commented Mar 19, 2017 • edited Loading

tjvr commented Mar 20, 2017

kach commented Mar 20, 2017

tjvr commented Mar 25, 2017

kach commented Mar 25, 2017

tjvr commented Mar 25, 2017

kach commented Mar 25, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017

kach commented Mar 27, 2017

tjvr commented Mar 27, 2017 • edited Loading

tjvr commented Mar 27, 2017 • edited Loading

tjvr commented Mar 27, 2017

kach commented Mar 28, 2017

tjvr commented Mar 28, 2017

tjvr commented Mar 28, 2017

tjvr commented Aug 9, 2017

bd82 commented Feb 17, 2017 •

edited

Loading

tjvr commented Mar 1, 2017 •

edited

Loading

bates64 commented Mar 19, 2017 •

edited

Loading

tjvr commented Mar 27, 2017 •

edited

Loading

tjvr commented Mar 27, 2017 •

edited

Loading