-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in Tokenizer #166
Comments
Hello. I've created a feature rich and fast tokenizer as part of another parsing toolkit. Example Json Lexer definition (only look at Lines 9-23) Features:
|
Ideally our tokenizer, or a wrapper for it, should support the same API as This is primarily motivated by the |
I'm assuming feed(x) means to tokenize the input x. What does rewind do? What use cases is it for? particularly in the context of a lexer? |
feed() gives the parser a chunk of source code. rewind() resets the parser back to a previous position. These are both seriously useful for applications such as syntax highlighting, because you can cache previous parses (note: I'm talking about syntax highlighting that requires the use of a parser, which is fairly rare — see tosh for an example language that does this) |
All-right. For a Lexer is not rewinding to a previous position much simpler than for a parser given that it normally a flat structure? If the lexer input is a token vector, cannot the previous positions be computed by using the offset of the |
@Hardmath123 I've been working on adding our new standalone tokenizer library, moo. I've found some issues with your proposed scheme, and I want your thoughts... So your idea was to define the lexer inside the .ne file. @{%
var Lexer = (typeof module === 'object' && module.exports) ? require('./lexer') : nearley.Lexer;
var l = Lexer([
['string', /"((?:[^\\"\n]|\\[]|\\u[0-9a-fA-F]{4})*?)"/],
['whitraw', /(?:\s|\n)+/],
['comment', /#[^\n]*?$/],
['word', /[\w\?\+]+/],
])
%}
|
I think |
How do you import postprocessors? |
Just like how you'd import anything else: |
Just published nearley-moo - a simple-ish wrapper for moo and nearley that automagically feeds tokens to the grammar. Here's an example. @Hardmath123 is it possible to define variables within the grammar.js scope through another file without polluting |
Yeah, periods in %token names is probably unavoidable. I'll go remind myself what #198 was all about. |
We need to be able to define a lexer inside a I propose we add some sort of |
How important do you think it is for lexers to be defined within the |
Very. I want to keep all the language-related stuff in one place. |
Okay, and you think it should be included as raw JS instead of some sort of moo-optimized lexer DSL? I guess that makes sense if we want to support non-moo lexers as well. In that case, an |
Yeah, I'm thinking you do
The argument to |
Surely |
Err, isn't it? I was hoping we could standardize the interface around nearley<=>moo, and then other lexers can write simple wrappers to comply with that interface. |
moo also supports stateful levers, so the interface needs to be a bit more complicated than that if we want rewind() to keep working. :-) |
Uhhh… yes. Yikes. How would that even… work? :-/ So, another option is to define the lexer in a standard |
Have a look at https://github.com/tjvr/moo#states and https://github.com/tjvr/moo#reset — calling save() on a lexer gives you back an object describing the lexer state. We can do this after every token and store the result on Column; then rewind() can use that state to reset() the lexer. Standardising on some interface does actually seem reasonable; there has to be some interface, and it doesn't really matter if it's moo's interface or a special nearley-specific one. So I think that plan of yours is best. (Ideally the rewind() API might look a bit more like moo's save/reset API; I might do that sometime in the future. I think I could re-implement rewind() in terms of |
FWIW, I don't think BTW, would it be okay to rewrite nearley-bootstrapped to use moo? It's probably the easiest way for me to test this, and would conveniently solve #188 :-) |
Here are the things I think we need:
The necessity of the |
That sounds expensive memory-wise. How important is rewinding-with-lexers, anyway?
Where else would that functionality be? Someone needs to know how to check whether a token is of a particular type; it makes sense for the lexer to be the one to know.
I think the conclusion here isn't "let's replace nearleyc", but rather "let's be more deliberate about the scope of nearleyc". |
The responsibility for match() should be the same thing which is responsible for generating the token specifiers. |
Nah. Particularly compared to all the Column/Item data |
This is totally resolved by now! |
We should consider including a tokenizer in nearley.
The text was updated successfully, but these errors were encountered: