-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds option to pass token end locations #201
Conversation
…end location with end positions
Hello! Thanks for working on this :-) Unfortunately I don't think we can merge it—it's quite invasive, and passing the options down to individual States makes me sad and really isn't something we want to do! Ideally, you would combine location information with your characters/tokens before passing them to nearley, something like this:
Unfortunately it then becomes harder to match them in nearley, since the "tokens" are no longer equal Custom tokens is actually something we've been thinking about very recently -- @Hardmath123 I think this is another argument for allowing user-defined marchers as per #198! This seems like a case where adding a custom test() function to each one of your literals is clearly less convenient than supplying a custom match function. |
|
Returning full token location information in each AST node - which gives the ability to easily do semantic highlighting of various node types in the original input string. As far as passing options through, the easiest (read HACKY) way to achieve this functionality was just to thread the location through and pass it as a 4th argument to the postprocessor functions. However, the resulting signature:
Leaves a bit to be desired. Have you guys addressed any ideas of how to pass a global context through to the postprocessors? Seems like a very useful thing to be able to support w/o having to resort to arbitrary globals. That was my motivation for trying to add options references, but usually I'd see that sort of thing achieved by a closure context instead of needing to explicitly assign a reference to each object instance. |
Smells like #63. That might be something we should revisit.
Hold on; I'd forgotten postprocessors already take |
It's not possible, afaics, to calculate the end index at each node without requiring that you return that information from every parsed terminal - which is a major performance hit/grammar complexity when dealing with discarded whitespace terminals. |
Have you considered using a tokenizer? |
In stricter languages I certainly prefer to, but token boundary classification would be fairly challenging for this particular DSL, unfortunately. |
@Hardmath123 How about in postprocessors binding It would trivially solve @EricMCornelius's problem; you could just read It might even make the people who want |
As an aside, the PEG.js semantic predicates provide the full location information for rule applications, which is what I've used in the past / am most familiar with: https://github.com/pegjs/pegjs#--predicate- It certainly does make quality error messages significantly easier to generate when you want to do (lazy) context-sensitive error generation during parsing (undefined variable names, etc.) |
Nearley doesn't allow giving a rejection reason, since many postprocessors may run or parsing may even continue after a rejection—see #195 |
I'd argue that valid production matches with invalid context are great times to generate quality errors. Undefined variable names being one of the canonical examples. That being said, it's off of the main topic, which is that there's definitely value in having the end positions of production matches available, if only for applications like syntax highlighting and making semantic modifications to original input strings. |
I might be misunderstanding, but can't you get the "end" offset by subtracting 1 from the "start" offset of the next item in your AST? |
I don't want to clutter up the AST generating nodes for whitespace and other punctuation in this particular case (operators, parentheses, etc.), and whitespace could be arbitrarily long. This makes it impossible to determine which section of the input was actually matched by a given production rule in a general manner. |
Drastically simplified example:
Here I'd want to have an AST representing the boolean expression tree, with leaf nodes representing comparison operations. Don't need to know location information specifically about the parentheses or the arbitrarily structured whitespace components. Only the fields, values, and boolean expression boundaries are necessary, but there's no straightforward way to get the full extent w/o major amounts of explicit bookkeeping when bubbling up data. |
I'm closing this since we're not going to merge it; however you're welcome to continue the discussion here or on my proposed solution here: #203 |
Yep. |
Hi there,
I've been working on an expression parsing language with syntax highlighting and thus found a need for the end locations for each rule application to be passed to the postprocessor functions. Not sure if there's a better alternative here, but it seemed sensible to thread the parser options reference down to the Column and State objects for future configuration flexibility.
Not sure if this is the best approach, as I've just started familiarizing myself with this project yesterday, so please let me know if there is a preferable alternative way to achieve this.